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PROBABILISTIC  INDEXING 

A  Statistical  Technique  For  Document 
Identification  And  Retrieval 

(SUMMARY) 


In  August  of  last  year,  in  an  internal  paper  entitled  "Probabilistic 
Indexing;  A  Statistical  Approach  to  the  Library  Problem",  one  of 
the  authors  proposed  a  novel  approach  to  the  problem  of  indexing 
and  searching  for  documentary  data  in  a  mechanized  library.  1  By 
December  1958  sorne  preliminary  experiments  with  Probabilistic 
Indexing  had  been  exl^cuted  and  the  results  were  published  in  a 
Ramo-Wooldridge  re^rt^ entitled  "Some  Experiments  with  Prob¬ 
abilistic  Indexing".  ^  Thtf present  report  describes  recent 


experiments  that  were  made  to  evaluate  the  re¬ 


trieval  ef^ctiveness  of  new  method 
searching  In  addition  several  refinemq 
basic  notions  of  Probabilistic  Indexing  « 


new  method  of  literature  indexing  and 


and  exten^ons  of  the 


"Probabilistic  Indexing;  A  Statistical  Approach  to  the  Library 
Problem",  M.  E.  Maron,  August  1958. 


"Some  Experiments  with  Probabilistic  Indexing",  M.  E.  Maron, 
J.  L.  Kuhns  and  L.  C.  Ray,  December  1958. 
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PART  I. 


THE  PROBLEM  OF  INFORMATION 
IDENTIFICATION  AND  RETRIEVAL 

(SUMMARY) 


The  basic  function  of  a  library  computer  is  to  accept  as  inputs,  re¬ 
quests  for  information,  and  to  supply  as  outputs,  a  list  of  those 
documents  which  are  most  relevant  for  each  request.  In  conv'*"' 
tional  systems  the  information  content  of  each  document  is  ’  ,nti- 
fied  by  assigning  to  it  a  set  of  index  tags  and  the  search  co-  ists  of 
finding  those  documents  whose  tags  are  logically  compatible  with 
the  tags  of  a  request.  Because  there  is  no  precise  relationship  be¬ 
tween  the  tags  and  the  subjects  that  they  denote,  the  search  "strategy” 
which  consists  of  matching  essentially  noisy  tags  causes  the  class 
of  documents  selected  by  a  request  to  contain  irrelevant  documents 
and,  even  worse,  to  exclude  relevant  documents. 

The  technique  of  Probabilistic  Indexing  starts  with  the  recognition 
that  index  terms  are  noisy  and  then  introduces  the  mathematics  of 
uncertainty  (viz.  ,  the  calculus  of  probability)  in  order  to  compute 
a  probable  relevance  number  for  each  document  selected  by  a  re - 
quest.  Probablistic  Indexing  involves  the  assigning  of  weights  to 
the  index  terms  that  are  used  to  tag  the  documents  of  a  library. 

These  weights,  in  addition  to  statistical  data  concerning  the  library 
usage,  are  then  used  by  the  library  computer  so  that,  given  a  re - 
quest  for  information,  an  inverse  statistical  inference  can  be  made 
in  order  to  derive  a  number  (called  the  "relevance  number")  for 
each  document,  which  is  a  measure  of  the  relevance  of  the  document 
for  the  requestor.  The  result  of  a  search  is  an  ordered  list  of  those 
documents  which  satisfy  the  request  and  ranked  according  to  their 
relevance  number.  The  technique  of  Probabilistic  Indexing  is  ex¬ 
tended  so  that  a  request  may  be  elaborated  upon  automatically,  in 
the  most  probable  direction,  so  as  to  increase  the  probability  of 
selecting  relevant  documents,  while  the  use  of  the  computed  rele¬ 
vance  numbers  allows  irrelevant  documents  to  be  rejected. 
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1.  INTRODUCTION 


1.  1  Initial  Remarks  About  Information  Retrieval 

In  recent  years  there  has  been  increased  attention  given  to  the  problem 
of  designing,  building  and  using  an  automatic  library  system  which  can 
accept  and  store  large  amounts  of  documentary  data  (as  for  example, 
that  contained  in  books,  journals  and  pamphlets  of  all  sorts)  so  that  the 
information  may  be  retrieved  rapidly  upon  subsequent  request.  The  re¬ 
quest  for  information  might  concern  a  single  rather  specific  item  of 
data  or  it  might  concern  a  broad  class  of  information  relevant  to  some 
desired  subject  matter.  Regardless  of  the  exact  category  that  we  might 
consider,  it  is  quite  clear  that  the  importance  of  the  over-all  problem 
of  information  retrieval  lies  in  the  fact  that  information  is  the  primary 
nutrient  without  which  science,  government,  industry  (and  society  itself) 
cannot  thrive  and  we  are  set  back  to  the  extent  that  valuable  information 
becomes  inaccessible  in  our  libraries. 

1 .  2  Limitations  and  Extensions  of  Retrieval  Systems 

In  what  follows  we  have  confined  our  attention  solely  to  the  considera¬ 
tion  of  an  information  retrieval  system.  Once  the  basic  conceptual 
problems  of  information  identification,  storage  and  retrieval  have  been 
successfully  managed  one  can  turn  attention  to  problems  concerned  with 
extending  the  range  of  automatic  information  handling.  That  is  to  say, 
when  dealing  with  documentary  data  (expressed  in  ordinary  language) 
one  might  like  not  only  to  store  and  retrieve,  but,  in  addition,  to  per¬ 
form  the  following  operations  on  the  information:  automatic  analysis 
to  detect  and  remove  redundant  information,  automatic  abstracting  of 
relevant  information,  automatic  verification  of  information  (i.e.  ,  given 
some  items  of  data,  decide  whether  or  not  they  are  inconsistent  with 
any  other  data  already  in  storage),  automatic  deduction  (i.  e.  ,  logical 
derivation),  automatic  correlation  of  data  so  as  to  establish  trends  and 
deviations  .  om  trends,  and  so  on.  It  appears  that  as  a  first  step  in  the 
direction  of  general  purpose  information  handling,  as  typified  by  the 
above  examples,  the  problem  of  information  identification  and  retrieval 
must  be  met  and  dealt  with  successfully. 
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1.3  A  Major  Difficulty 


There  are  a  number  of  obvious  difficulties  associated  with  the  so-called 
"library  problem"  (i.e.  ,  the  problem  of  information  search  and  retrieval) 
and  the  one  usually  cited  relates  to  the  fact  that  documentary  data  are 
being  generated  at  an  alarming  rate  (the  growth  rate  is  exponential  -- 
doubling  every  12  years  for  some  libraries)  and  consequently,  considera¬ 
tions  of  volume  alone  make  the  problem  appear  frightening.  However, 
the  heart  of  the  problem  does  not  concern  size,  but,  rather,  it  concerns 
meaning .  That  is  to  say,  the  major  difficulties  associated  with  the  li¬ 
brary  problem  concern  the  identification  of  information  content- -the 
problem  of  determining  of  two  items  of  data  which  is  "closer"  in  mean¬ 
ing  to  a  third  item --the  problem  of  determining  whether  or  not  (or  to 
what  degree)  some  document  is  relevant  to  a  given  request,  etc.  In  or¬ 
dinary  language  there  are  no  rules  which  prescribe  how  words  should 
be  selected  and  combined  in  order  to  express  various  kinds  and  shades 
of  meanings.  It  is  because  ordinary  language  is  vague  and  ambiguous 
and  because  there  are  no  rules  which  allow  us  to  manipulate  the  infor¬ 
mation  on  the  basis  of  its  meaning  that  the  problem  is  so  complex.  This 
then  is  the  heart  of  the  problem  and  we  shall  have  more  to  say  about  it 
subsequently. 

1.4  Levels  of  Consideration 

The  problem  of  an  automatic  library  can  be  examined  at  several  levels 
ranging  from  the  equipment  frame  of  reference  to  a  basic  information 
flow  perspective.  There  have  been  a  number  of  "hardware"  solutions 
to  the  problem  of  library  size  (e,  g.  ,  use  of  microfilm,  microcards, 
minicards,  etc.  )  but  since  the  major  problem  is  logico-linguistic,  we 
shall  cast  the  problem  on  the  conceptual  level.  Thus,  we  propose  to 
make  an  analysis  of  the  logic  of  the  problem,  to  describe  a  technique 
for  dealing  with  the  problem,  to  present  some  logical  and  experimental 
data  to  support  our  technique  and  to  lay  aside,  at  least  for  the  present, 
any  considerations  dealing  with  the  physical  implementation  of  the 
technique . 
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2. 


THE  CONVENTIONAL  APPROACH  TO  AN  AUTOMATIC  RETRIEVAL 
SYSTEM 


2.  1  The  Role  of  Indexes 

Because,  at  least  for  the  immediate  future,  no  machine  can  actually 
read  a  document  and  decide  whether  or  not  its  subject  matter  relates  to 
some  given  request  subject,  it  is  necessary  to  use  some  intermediate 
identifying  tags;  namely,an  indexing  system.  An  index  to  a  document 
acts  as  a  tag  by  means  of  which  the  information  content  of  the  document 
in  question  may  be  identified.  The  index  may  be  a  single  term  or  a  set 
of  terms  which  together  tag  or  identify  the  content  of  each  document. 

The  class  of  terms  (whether  it  be  a  classification  indexing  system,  co¬ 
ordinate  indexing,  etc.)  which  constitutes  the  allowable  vocabulary  for 
indexing  documents  in  a  library  is  the  common  language  which  bridges 
the  gap  between  the  information  in  the  documents  and  the  information 
requirements  of  the  users. 

2.  2  The  Assignment  of  Indexes 

In  principle,  an  indexer  reads  an  incoming  document  and  then  selects 
one  or  several  of  the  index  terms  from  the  "library  vocabulary"  and  he 
coordinates  the  selected  terms  with  the  given  document  (or  its  acces¬ 
sion  number).  Thus,  the  assignment  of  terms  to  each  document  is  a  go 
or  no-go  affair--for  each  term  either  it  applies  to  the  document  in  ques¬ 
tion  or  it  does  not.  Furthermore,  the  process  of  indexing  information 
and  that  of  formulating  a  request  for  information  are  symmetrical  in  the 
sense  that  just  as  the  subject  content  of  a  document  is  identified  by  co¬ 
ordinating  to  it  a  set  of  index  terms,  so  also,  the  subject  content  of  a 
request  must  be  identified  by  coordinating  to  it  a  set  of  index  terms. 
Thus,  the  user  who  has  some  particular  information  need  identifies  this 
need  in  terms  of  a  library  request  consisting  of  one  or  several  index 
terms  or  logical  combinations  thereof. 
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2.  3  The  Notion  of  Semantic  Noise 

The  correspondence  between  the  information  content  of  a  document  and 
its  set  of  indexes  is  not  exact  because  it  is  extremely  difficult  to  specify 
precisely  the  subject  content  of  a  document  by  means  of  one  or  several 
index  words.  If  we  consider  the  set  of  all  index  terms,  on  the  one  hand, 
and  the  class  of  subjects  that  they  denote,  on  the  other  hand,  then  we 
see  that  there  is  no  strict  one-one  correspondence  between  the  two.  It 
turns  out  that  given  any  term  there  are  many  possible  subjects  that  it 
could  denote  (to  a  greater  or  lesser  extent)  and  conversely,  any  parti¬ 
cular  subject  of  knowledge  (whether  broad  or  narrow)  usually  can  be 
denoted  by  a  number  of  different  terms.  This  situation  may  be  charac¬ 
terized  by  saying  that  there  is  "semantic  noise"  in  the  index  terms. 

Just  as  the  correspondence  between  the  information  content  of  a  docu¬ 
ment  and  its  set  of  indexes  is  not  exact,  so  also  the  correspondence  be¬ 
tween  a  user's  request,  as  formulated  in  terms  of  one  or  many  index 
words,  and  his  real  need  (intention)  is  not  exact.  Thus,  there  is  se¬ 
mantic  noise  in  both  the  document  indexes  and  in  the  requests  for 
information. 

One  of  the  reasons  that  the  index  terms  are  noisy  is  due  to  the  fact  that 
the  meanings  of  these  terms  are  a  function  of  their  setting.  That  is  to 
say,  the  meaning  of  a  term  in  isolation  is  often  quite  different  when  it 
appears  in  an  environment  (sentence,  paragraph,  etc.)  of  other  words. 
The  position  and  frequency  of  other  words  help  to  clarify  and  specify 
the  meanings  of  a  given  term.  Furthermore,  individual  word  meanings 
vary  from  person  to  person  because,  to  a  large  degree,  the  meanings 
of  the  words  are  a  matter  of  individual  experience.  This  is  all  to  say 
that  when  words  are  isolated  and  used  as  tags  to  index  documents,  it 
is  difficult  to  pin  down  their  meanings,  and,  consequently  it  is  difficult  to 
use  them  as  such  to  accurately  index  documents  or  to  accurately  specify 
a  request. 


5 


2 . 4  Conventional  Stopgaps 

Many  workers  in  the  field  of  library  science  have  attempted  to  reduce 
the  semamtic  noise  in  indexing  by  developing  specialized  indexing  sys¬ 
tems  for  different  kinds  of  libraries.  An  indexing  system  tailored  to  a 
particular  library  would  be  less  noisy  than  would  be  the  case  otherwise. 
(In  a  sense,  to  tailor  an  index  system  to  a  specific  library  is  to  apply 
the  principle  of  an  ideoglossary,  as  it  is  used  in  machine  language  trans¬ 
lation,  to  remove  semantic  ambiguity.  )  In  spite  of  careful  work  in  the 
developing  of  a  "best”  set  of  tags  for  a  particular  library,  the  problem 
of  semantic  noise  and  its  consequences  remain,  albeit  to  a  lesser  extent. 

Another  attempt  to  remove  the  semantic  noise  in  request  formulations 
has  to  do  with  the  use  of  logical  combinations  of  index  terms.  That  is 
to  say,  if  two  or  more  index  terms  are  joined  conjunctively,  it  helps 
to  narrow  or  more  nearly  specify  a  subject.  On  the  other  hand,  the  same 
set  of  terms  connected  disjunctively  broadens  the  scope  of  a  request.  ^ 
Thus,  using  logical  combinations  of  index  terms  one  would  hope  to  either 
avoid  the  retrieval  of  irrelevant  material  or  avoid  missing  relevant 
material.  However,  although  a  request  using  a  set  of  index  terms 
joined  conjunctively  does  decrease  the  probability  of  obtaining  irrele¬ 
vant  material,  it  also  increases  the  probability  of  missing  relevant 
material.  The  converse  holds  for  a  request  consisting  of  a  disjunction 
of  index  terms. 

2.  5  The  Selection  Function  and  Some  Consequences 

We  have  said  that  documents  are  indexed  by  assigning  one  or  several 
index  terms  to  each,  and,  similarly,  a  library  request  for  information 
is  formulated  by  selecting  one  or  several  of  those  index  terms  which 
most  closely  denote  the  desired  information  need.  Given  a  request,  the 


Strictly  speaking,  the  terms  "intersection"  and  "union"  should  be  used 
instead  of  "ccmjunction"  and  "disjunction",  respectively,  since  we  are 
referring  to  classes  and  not  propositions. 
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next  step  is  to  search  and  select  all  those  documents  (or  their  accession 
numbers)  whose  sets  of  index  terms  are  logically  compatible  with  those 
of  the  request.  Thus,  conventional  machine  searching  consists  of 
matching  the  indexes  and  the  requests  exactly.  The  actual  matching 
procedure  is  a  go  or  no-go  affair--a  set  of  index  terms  (associated  with 
a  particular  document)  either  satisfies  a  request  ,or  it  does  not. 

The  fact  that  conventional  selecting  (searching)  consists  in  deciding 
whether  an  exact  logical  match  exists  between  classes  of  essentially 
noisy  tags  implies  that  the  result  of  a  search  does  not  provide  an  opti¬ 
mal  list  of  documents.  The  fact  that  conventional  searching  consists 
in  matching  noisy  tags  implies  that  the  result  of  a  search  provides  docu¬ 
ments  which  are  irrelevant  to  the  real  needs  of  the  requestor,  and,  even 
worse,  some  of  the  really  relevant  documents  are  not  retrieved.  If  one 
broadens  a  request  (by  using  more  general  terms)  so  as  to  reduce  the 
probability  of  missing  a  relevant  document,  he  increases  the  probability 
of  obtaining  irrelevant  material.  Conversely,  if  he  narrows  his  request 
(by  using  rather  specific  terms)  in  order  to  avoid  irrelevant  material, 
he  increases  the  probability  of  missing  relevant  information.  This  un¬ 
desirable  situation  is  not  helped  by  the  fact  that  the  list  of  documents 
which  results  from  a  search  appears  in  a  random  order;  i.e.  ,  there  is 
no  hint  given  to  the  requestor  that  some  of  the  documents  that  have  been 
selected  are  less  relevant  to  the  request  than  others. 

In  the  following  section  we  shall  present  the  basic  notions  of  the  technique 
of  Probabilistic  Indexing  and  show  that  this  approach  to  the  library  prob¬ 
lem  improves  retrieval  effectiveness  both  by  reducing  the  probability  of 
obtaining  irrelevant  documents  and  by  increasing  the  probability  of  sel¬ 
ecting  relevant  documents.  Furthermore,  the  technique  of  Probabilistic 
Indexing  provides  as  the  result  of  a  search  an  ordered  list  of  those  docu¬ 
ments  which  satisfy  the  request,  ranked  according  to  relevance. 
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3.  THE  BASIC  NOTIONS  OF  PROBABILISTIC  INDEXING 


3 .  1  The  Probabilistic  Nature  of  the  Problem 

To  say  that  index  tags  are  noisy  is  to  say  that  there  is  an  uncertainty 
about  the  relationship  between  the  ternas  and  the  subjects  denoted  by 
the  terms.  That  is  to  say,  given  a  document  indexed  with  its  assigned 
index  term  (or  terms),  there  is  only  a  probability  that  if  a  user  is  inter¬ 
ested  in  the  subject  (or  subjects)  designated  by  the  tag,  he  will  find 
that  the  document  in  question  is  relevant.  This  situation  is  analogous 
to  the  case  when  a  message  is  selected  and  transmitted  over  an  elec¬ 
trical  communication  channel  which  is  noisy,  and,  as  a  result,  there 
is  only  a  probability  that  the  selected  message  will  be  received  at  the 
other  end  of  the  channel.  Thus,  given  any  arbitrary  received  message 
there  is  a  distribution  which  describes  the  probability  that  it  (i.  e.  ,  the 
received  message)  resulted  from  each  of  the  possible  transmitted  mes¬ 
sages.  Communication  theory  tells  us  that  the  ideal  receiver  is  one 
that  makes  an  inverse  inference  and  computes,  given  the  received  mes¬ 
sage,  the  most  probable  message  that  was  transmitted. 

Again,  one  may  consider  by  analogy  that  the  documents  of  a  library  are 
the  messages  that  are  selected  for  transmission,  that  the  indexer  is  the 
noisy  channel,  and  that  the  index  terms  are  the  messages  that  are  re¬ 
ceived  after  passing  through  the  channel.  By  analogy,  the  ideal  search¬ 
ing  systemis  the  one  that  makes  an  inverse  inference  and  computes, 
given  the  index  terms  of  a  request,  the  most  probable  document  that  is 
relevant  to  the  request.  Given  this  analogy  between  searching  a  library 
of  documents  and  communicating  in  the  presence  of  noise  we  see  that 
the  real  problem  is  to  introduce  the  proper  probabilities  so  that  the 
necessary  inverse  statistical  inference  can  be  computed. 

3.  2  The  Notion  of  Weight'ng  Index  Terms 

We  have  suggested  that  the  ideal  search  system  is  one  that  computes  the 
distribution  which  describes  the  probability  that  a  document  will  satisfy 
a  requestor.  This  means  that  given  a  request,  a  class  of  documents  is 
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selected  (namely  those  whose  index  terms  are  logically  compatible  with 
the  terms  and  logic  of  the  request)  and  for  each  document  in  this  class, 
the  system  will  have  to  compute  a  number,  called  the  "relevance  num¬ 
ber"  which  will  be  a  measure  of  the  expected  degree  of  relevance  of  the 
document  for  the  requestor.  How  should  such  a  relevance  number  be 
derived?  Surely,  it  should  be  a  function  of  the  probability  that  if  a  per¬ 
son  is  interested  in  the  content  of  a  given  document  then  he  will  use  the 
tag  (or  tags)  associated  with  the  document  in  requesting  information  on 
that  subject.  How  to  estimate  this  probability? 

As  we  have  stated  previously,  conventional  indexing  consists  in  having 
an  indexer  decide  on  a  yes-no  basis  whether  or  not  a  given  term  applies 
for  a  particular  document.  Either  a  tag  is  applicable  to  the  document 
or  it  is  not--there  is  no  middle  ground.  It  is  much  more  reasonable 
and  realistic  to  make  this  judgment  on  a  probabilistic  basis;  i.  e.  ,  to 
assert  that  a  given  tag  may  hold  with  a  certain  probability  or  weight. 
Properly  scaled  this  weight  can  be  used  as  an  estimate  of  the  above 
probability;  viz.  ,  the  probability  that  if  an  individual  desires  informa¬ 
tion  of  the  type  contained  in  the  document  then  he  will  use  the  tag  in 
question  in  requesting  that  information.  The  details  are  given  in 
Part  II,  1. 

Given  the  ability  to  weight  index  terms,  one  can  characterize  more  pre¬ 
cisely  the  information  content  of  a  document.  The  indexer  may  wish  to 
assign  a  low  weight  such  as  0.  1  or  0.  2  to  a  term  rather  than  to  say  that 
the  term  does  not  hold  at  all  for  the  document.  Conversely,  the  indexer 
may  wish  to  assign  a  weight  of  0.  8  or  0.  9  to  a  term  rather  than  to  say 
that  it  definitely  holds  for  a  document.  Thus,  given  weighted  (probabil¬ 
istic)  indexing  it  is  possible  to  more  accurately  characterize  the  infor¬ 
mation  content  of  a  document.  The  notion  of  weighting  the  index  terms 
that  are  assigned  to  documents  and  using  these  weights  to  compute  rele¬ 
vance  numbers  is  basic  to  the  technique  which  we  call  "Probabilistic 
Indexing" . 
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3.  3  The  A  Priori  Probability  Distribution 


One  of  the  major  goals  of  the  method  of  Probabilistic  Indexing  is  to 
compute  a  relevance  number  for  each  document  on  the  basis  of  a  given 
request.  The  retrieved  documents  will  be  ordered  according  to  their 
relevance  numbers  and  hence  the  outcome  of  a  search  will  be  a  list  of 
those  documents  whose  index  terms  satisfy  the  request;  the  documents 
will  be  ranked  according  to  the  probability  of  satisfying  the  request, 
thereby  providing  the  user  with  an  optimal  search  strategy  in  reading 
the  retrieved  information. 

We  have  indicated  that  the  relevance  number  of  some  document  should 
be  a  function  of  the  probability  that  if  an  individual  desires  information 
of  the  kind  contained  in  D^,  he  will  use  the  tags  associated  with  in  his 
request  for  information.  We  have  indicated  further  that  this  probability 
can  be  estimated  by  an  indexer,  and  in  fact,  the  weight  of  a  tag  (i.  e.  , 
the  degree  with  which  it  holds  for  a  document)  is,  when  properly  scaled, 
an  estimate  of  the  above  probability.  (This  will  be  discussed  more 
completely  in  Part  II,  1  with  an  explanation  of  how  the  initial  estimates 
can  be  modified  so  as  to  approach  the  correct  probability.  )  In  addition 
we  assert  that  the  relevance  number  should  also  be  a  function  of  the 
a  priori  probability  distribution  of  document  usage.  The  a  priori  prob¬ 
ability  distribution  of  usage  of  documents  in  a  sense  describes  the  pop¬ 
ularity  of  documents  in  a  library.  The  justification  for  including  the 
statistics  on  the  a  priori  probability  of  a  document  in  the  computation 
for  relevance  number  will  be  given  in  Part  II,  1  also. 

3 . 4  The  Schema  for  Computing  the  Relevance  Number 

Although  the  details  of  the  logical  and  mathematical  justification  of 
Probabilistic  Indexing  are  presented  in  Part  II  we  briefly  summarize 
the  theoretical  motivation  behind  our  procedure,  for  the  reader's  con¬ 
venience.  Given  the  a  priori  probability  distribution  of  usage  of  docu¬ 
ments  and  the  statistical  indexing  information  for  each  document,  the 
actual  search  would  involve  an  inverse  probability  calculation,  so-called 
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Bayes'  Theorem.  This  inverse  probability  calculation  computes  the 
probability  that  a  document  satisfies  the  request.  The  situation  can 
be  presented  schematically  as  follows: 


P(A,  D^) 

the  a  priori  probability  that  the  i^^ 
document  will  be  retrieved. 

P(A.  D. ,  1.) 

1  J 

til 

the  weight  with  which  the  i  docu¬ 
ment  is  indexed  with  the  index 

term. 

P(A.  I.,  D.) 

J  1 

til 

the  probability  that,  if  the  j  index 
term  is  requested,  the  i'-^  document 
will  satisfy  the  request 

P(A.  I.,  D.} 
j’  y 

rr 

P^A,  D  )  •  P(A.  D.  ,1.) 

1  \  j/ 

P(A.iy 


Thus  the  inverse  probability  calculation  will  be  made  for  each  document 
which  is  indexed  with  the  index  word.  For  each  of  those  a  number  will 
be  computed  which  will  be  a  function  of  both  the  degree  to  which  the 
document  is  indexed  by  the  given  index  term  and  also  the  relative  fre¬ 
quency  of  usage  of  the  document.  Once  these  computations  have  been 
made  these  numbers  and  the  associated  document  accession  numbers 
will  be  sorted  so  that  the  document  which  has  the  highest  probability 
of  satisfying  the  request  will  appear  first  on  the  list,  and  that  document 
with  the  lowest  probability  of  satisfying  a  request  will  be  last  on  the 
list. 

3 . 5  Request  Weights 

Just  an  indexer  rriay  coordinate  a  weight  to  an  index  term  (to  indicate 
the  degree  that  the  tag  in  question  holes  for  a  given  document),  so  also, 
the  library  system  should  allow  a  user  to  coordinate  weights  to  those 
index  terms  that  he  uses  in  formulating  his  reque  st  for  information. 

Just  as  weighted  index  tags  allow  the  indexer  to  characterize  the  infor¬ 
mation  content  of  a  document  more  precisely,  so  also,  weighted  request 
tags  provide  additional  precision  to  the  formulation  of  a  library  request. 
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DIAGRAM  1:  A  SCHEMA  FOR  BAYES'  THEOREM. 


Those  subjects  which  are  most  important  to  a  user's  needs  will  have 
high  weights  coordinated  with  their  tags  and  conversely.  The  methods 
of  Probabilistic  Indexing  provide  rules  describing  how  request  weights 
are  to  be  used  in  the  extended  computation  for  relevance  number. 

3 .  6  The  Automatic  Elaboration  of  a  Request 

Roughly  speaking,  we  can  say  that  the  set  of  index  terms  (and  their 
weights)  identify  the  information  content  of  each  document  with  which 
it  is  coordinated  and,  likewise,  the  request  formulated  in  terms  of 
weighted  index  tags  and  associated  logic  identifies  theuser's  informa¬ 
tion  need.  The  next  step  in  automatic  retrieval  is  to  match  identifica¬ 
tions  in  order  to  determine  which  documents  are  to  be  selected,  re¬ 
trieved  and  given  to  the  user.  Given  the  class  of  selected  documents, 
the  computation  of  relevance  number  allows  a  library  system  to  rank 
the  documents  according  to  their  probable  relevance  to  the  requestor.  It 
is  clear  however,  that  if  the  initial  request  is  inadequately  (or  incom¬ 
pletely)  formulated,  then  the  class  of  selected  documents  will  not  be 
optimal  and  no  amount  of  ranking  by  relevance  will  correct  this  difficulty- 
As  a  remedy  for  this  situation.  Probabilistic  Indexing  includes  methods 
for  automatically  elaborating  upon  any  arbitrary  request  so  as  to  im¬ 
prove  its  selectivity.  That  is  to  say,  included  among  the  methods  of 
Probabilistic  Indexing  are  mechanical  rules  for  automatically  relating 
index  terms  (and  documents)  so  that  given  a  request  for  a  particular 
set  of  index  terms  a  computer  can  determine  what  other  terms  are 
most  closely  related  to  the  request  and  thereby  automatically  elaborate 
upon  it  in  the  most  probable  direction,  in  order  to  improve  the  selection. 
The  rules  involve  the  derivation  of  probabilistic  weighting  factors  be¬ 
tween  index  terms  and  a  number  of  machine  "strategies"  for  deciding  how 
to  go  from  a  given  request  to  its  proper  elaboration. 
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PART  II. 


A  THEORETICAL  DISCUSSION 

(SUMMARY) 


The  conceptual  framework  of  the  library  which  grows  out  of  the  basic 
notions  of  Probabilistic  Indexing  allows  us  to  divide  the  over-all  prob¬ 
lem  of  information  searching  and  retrieval  into  two  parts.  The  first 
part  relates  to  the  problem  of  selecting  an  optimal  class  of  documents 
from  the  entire  library  in  order  to  satisfy  a  given  request  for  infor¬ 
mation.  We  call  this  the  selection  problem.  Once  a  class  has  been 
selected  each  document  in  it  is  ranked  according  to  its  probable  rele¬ 
vance.  We  call  this  the  problem  of  the  relevance  number. 

Part  II  of  this  report  contains  the  logico-mathematical  analysis,  ex¬ 
planation  and  (a  priori)  justification  of  the  methods  of  Probabilistic  In¬ 
dexing.  In  particular,  section  1  discusses  the  notion  of  probable  rele¬ 
vance  number  and  provides  the  detailed  explication  of  this  notion  in 
terms  of  the  theorem  of  Bayes.  We  discuss  the  meaning  of  index  and 
request  weights  and  the  computational  rules  which  allow  the  relevance 
number  to  be  computed.  Section  2  provides  the  discussion  of  how  an 
arbitrary  request  may  be  automatically  elaborated  upon  in  its  most 
probable  direction,  in  order  to  improve  the  selection  of  documents. 

We  describe  various  statistical  measures  for  determining  the  "close¬ 
ness"  between  the  terms  that  constitute  index  space  and  indicate  how 
these  measures  can  be  used  to  elaborate  on  a  request  in  order  to  pro¬ 
duce  an  optimal  selection  of  documents. 
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1. 


THE  DERIVATION  OF  THE  RELEVANCE  NUMBER 


1.  I  Initial  Remarks 

We  can  clarify  the  library  problem  by  considering  the  following  two 
fundamental  questions:  (1)  Given  the  class  of  documents  that  satisfy 
the  logic  of  a  request  which  of  these  is  most  probably  relevant  to  the  re¬ 
questor,  next  most  probably  relevant,  etc?  (2)  Interpreting  a  request 
as  giving  clues  to  the  real  information  requirements  of  the  requestors 
how  can  the  request  be  elaborated  in  order  to  improve  the  class  of  re¬ 
trieved  documents;  or,  more  generally,  how  can  the  document  selec¬ 
tion  process  be  improved?  In  this  section  we  will  discuss  the  first 
question.  In  particular  we  will  establish  a  measure  of  probable  rele¬ 
vance  (the  relevance  number)  and  show  how  this  quantity  can  be  com¬ 
puted.  Our  methods  in  establishing  computational  procedures  involve 
_a  priori  considerations  as  well  as  experimental  testing.  The  a  priori 
considerations  play  their  role  in  the  choice  of  schemata  from  the 
theory  of  probability  as  models  for  our  procedures  and  in  the  statistical 
modification  of  various  quantities  which  have  been  estimated  initially. 
The  problem  of  justification  of  these  procedures  can  be  considered  from 
two  aspects:  (1)  the  experimental  testing;  (2)  a  theoretical  development. 
Since  success  is  the  only  criterion  upon  which  a  retrieval  system  should 
be  judged,  we  see  that  a  theoretical  development  is  unnecessary;  never¬ 
theless  it  is  not  superfluous,  for  it  enables  us  to  isolate  assumptions 
and  points  the  way  for  possible  refinements  in  the  procedure.  The  fol¬ 
lowing  sections  (1.2-  1.  14)  present  a  theoretical  development  of  the 
computational  procedure  for  requests  formulated  as  Boolean  functions 
of  the  index  terms. 

1 .  2  Notational  Matters 

By  "P(A,B)''  we  mean  the  probability  of  an  event  of  class  B  occurring 
with  reference  to  an  event  of  class  A.  We  shall  be  interested  in  the 
following  classes  of  events: 

a.  :  obtaining  the  i^^  document  and  finding  it  relevant. 
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I.:  requesting  information  on  the  field  of  interest  (subject, 

■1  th 

area  of  knowledge)  designated  by  the  j  index  term  L .  (We 
use  the  same  symbol  for  the  event  class  and  the  index  term, 
but  the  proper  meaning  will  be  clear  in  context.  ) 

A;  requesting  information  from  the  library. 


th 

We  use"w..''  (also  "w.(I.)")  to  denote  "the  degree  to  which  the  j'"*  index 
ij  th.1  J  ® 

term  applies  to  the  i  document'.' Note  that  the  values  w.  .  define  a  matrix 

th 

called  the  "probabilistic  matrix",  where  the  entry  in  the  i  row  and  the 
j  column  is  the  weight  . 


1.  3  The  A  Priori  Probability  Distribution-  -A  Remark  about  the  Class  A 

We  call  "P(A,  D^)"  the  a  priori  probability  of  the  document  D^.  Although 
this  probability  arises  in  the  inverse  probability  calculation,  to  be  dis¬ 
cussed  below,  we  prefer  to  introduce  it  on  a  more  intuitive  level  as  an 
essential  ingredient  of  Probabilistic  Indexing.  In  a  literature  search, 
if  two  documents  are  indexed  identically  then  the  document  with  the 
greater  a  priori  probability  of  being  relevant  should  be  read  first.  This 
is  the  statistical  analogue  of  "recommendation"  of  texts.  The  calcula¬ 
tion  of  P(A,Dj^)  is  obtained  by  the  processing  of  library  statistics.  How 
this  is  done  will  be  discussed  in  section  1.  13,  but  for  the  present  we 
are  concerned  with  only  one  restrictive  condition.  This  condition  re¬ 
quires  a  qualification  of  the  class  A,  For  convenience  in  tabulating  li¬ 
brary  statistics  we  will  not  consider  A  to  be  the  class  of  ail  requests 
but  only  those  that  yield  a  document  relevant  to  the  requestor.  We  de¬ 
fine  the  termination  of  such  an  event  to  be  when  a  relevant  document  is 
obtained.  Thus  an  event  of  class  A  will  be  followed  by  one  and  only  one 
event  of  class  D^;  if  a  request  R  produces  two  relevant  documents  we 
regard  this  as  two  instances  of  the  request,  etc.  The  assumption  that 
a  request  will  produce  a  relevant  document  we  can  call  "the  axiom  of 
completeness  of  the  library  with  respect  to  the  index  terms".  In  par¬ 
ticular,  this  assumption  allows  us  to  normalize  the  a  priori  probability 
distribution.  Hence  we  assert 

^P(A,  D.)=l.  (1) 

i=l 
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1 . 4  Statement  of  the  Problem 

Given  a  request  R  we  want  to  rank  the  library  documents  according  to 

their  probable  relevance  to  the  requestor.  A  function  f  on  a  set  Dj^, 

D,,  .  .  .  ,  D  can  be  used  to  rank  the  set  by  means  of  the  function  values 
c  n 


Thus  we  want  to  look  for  functions  which  somehow  measure  the  prob¬ 
able  relevance  of  a  document.  If  this  is  accomplished  then  a  library 
search  can  be  represented  as  a  transfer  function  from  requests  (the 
input  space)  to  a  space  of  functions  of  the  variable  i(i  =  l,  .  .  .  ,  n)  rep¬ 
resenting  the  accession  number  of  a  document.  That  is  to  say,  a 
search  by  Probabilistic  Indexing  will  lead  to  a  function;  the  values  of 
this  function  give  the  probable  relevances  of  the  documents.  (Those 
i's  for  which  this  function  is  not  zero  give  the  accession  numbers  of 
the  documents  that  match  the  logic  of  the  request.) 


i .  5  The  Probable  Relevance  Function 

We  examine  first  the  simplest  type  of  request;  viz .  ,  R.  One  measure 
of  probable  relevance  is  given  by  the  function 

P(A.R.D.). 


because  this  is  the  probability  that  a  library  user,  making  the  request 
I.,  will  find  the  i''^  document  relevant.  We  call  this  the  probable  rele- 
vance  function.  Now,  keeping  I.  fixed,  this  function  should  certainly 
vary  as  the  a  priori  probability  P(A,  D^^)  and  also  vary  as  j  •  Let  us 
assume  that  the  probable  relevance  varies  jointly  as  P(A,  D^^)  and 
We  obtain  then 


P(A.R,D.)  =  a^  P(A,D.).w.^  (j  fixed) ,  (2) 

where,  because  we  have  one  such  equation  for  each  j,  we  indicate  that 
the  constant  of  variation  can  itself  be  a  function  of  j.  Thus,  given 
1.,  we  rank  the  documents  according  to  the  quantity 
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P(A.D.)-w.j. 

Equation  (2)  can  be  regarded  as  the  fundamental  principle  for  Proba¬ 
bilistic  Indexing.  Subsequent  experiments  are  to  be  thought  of  as  em¬ 
pirical  testing  of  this  principle. 

1 . 6  The  Meaning  of  the  Weights 

The  assumption  (2)  allows  us  to  give  a  simple  interpretation  to  the 
weights.  From  an  inverse  probability  calculation  we  have 


P(A,  D  )-P(A.D  I  ) 

P(A.l  .D  )  =  - i J- 

^  *  P(A,  I  ) 


Comparing  (2)  and  (3)  we  see  that 


- -  .P(A.  D  I  )  . 

P(A.P)  i  J 


Now  the  coefficient  can  be  determined  from  (2).  For,  if  we  sum 
both  sides  of  (2)  over  the  subscript  i  and  note  that 


2p(a.i.,d,)  =  1 

i  ^ 


by  the  axiom  of  library  completeness  with  respect  to  the  index  terms, 
then  we  obtain 


^  =  2  P(A,D.).w 

a.  i 
J 


Thus,  we  have  the  result 


2  P(A,D.)  w 


P(A,  F) 


P(A.  D.  ,  I.). 
1*  j' 
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We  call  the  coefficient  of  P(A.D.,1.)  "p."  so  that 

i’  j' 


w..  =  p.  •  P(A.D.,I.), 
ij  1  J 


where 


Sp(A,D.)- 
P  i - 

■»  P(A.  P) 


1 . 7  The  Modified  Weight 

Our  theory  shows  immediately  the  possibility  of  correcting,  in  a  cer¬ 
tain  sense,  the  values  w.^  given  by  the  indexer.  Let  us  modify  the 
weights  by  using  the  factor  p^  defined  above;  thus: 

=  w../p.  =  P(A.D.,I.).  (9) 

ij  ij  'j  '  i’  j' 

Now  it  is  true  that  such  a  modification  has  no  effect  on  the  document 
ranking  for  a  single  request  1^,  but  the  possibility  of  modification  al¬ 
lows  us  to  justify  our  computational  procedure  for  Boolean  functions 
of  the  index  terms  as  well  as  to  isolate  certain  problems  in  the  proc¬ 
essing  of  the  library  statistics.  In  a  sense,  modifying  the  weights  ac¬ 
cording  to  (9)  is  a  smoothing  operation,  for  an  inspection  of  the  num¬ 
erator  of  Pj  shows  it  to  be  the  weighted  mean  of  the  Wj^^  in  the  col¬ 
umn  of  the  probabilistic  matrix  (with  weights  given  by  P(A,D^)).  Such 
a  smoothing  is  necessary  in  making  weights  assigned  to  different  index 
terms  comparable. 


1 . 8  Further  Remarks  on  the  Meaning  of  the  Weight 

Formula  (8)  shows  the  weight,  which  we  originally  interpreted  as  the 
degree  to  which  the  index  term  applied  to  the  document,  to  be  intimately 
related  to  the  probability  P(A.Dj,,I.).  This  is  a  logical  consequence  of 
assumption  (2).  The  statistical  meaning  of  this  probability  can  be 
clarified  as  follows:  Suppose  we  presented  to  each  member  of  a  sampling 
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of  potential  library  users  the  document  and  asked  of  them  if  they 
would  have  used  the  term  k  in  requesting  it.  The  resulting  relative 
frequency  in  the  sample  would  be  an  estimate  of  P(A.  D.,  P).  Now  for¬ 
mula  (8)  relates  the  semantic  measure  "degree  to  which  the  index  term 
applies  to  the  document"  and  the  statistical  measure  of  how  the  terms 
will  be  used  in  retrieval  requests;  viz.  ,  P(A.  T).  Since  the  statistics 
required  are  not  available  and  certain  quantities  must  be  estimated, 
formula  (9)  tells  us  that  the  indexer  would  do  better  by  estimating  P(A. 
thus  bringing  the  coefficient  Pj  as  close  to  unity  as  possible. 


One  might  raise  the  following  question  at  this  point:  If  the  indexer  were 

required  to  estimate  P(A.  D.,Ij),  why  not  estimate  P(A.  directly, 

since  this  is  the  goal  of  the  computations?  Actually,  this  is  not  quite 

correct.  As  we  will  show  in  the  next  section  the  goal  of  the  computations 

is  the  determination  of  P(A.  R,  D.)  where  R  is  any  Boolean  function  of  the 

index  terms.  This  quantity  must  be  expressed  in  terms  of  probabilities 

each  involving  one  and  only  one  index  term.  The  only  way  to  do  this  is 

to  transform  P(A.R,D^)  so  that  R  goes  into  the  attribute  class,  but  then 

the  result  involves  the  probabilities  P{A.  D.,I.).  In  other  words,  we 

^  J 

always  need  the  quantities  P(A.D.,T)  but  hardly  ever  the  quantities 
P(A.  L,D.).  The  second  argument  in  favor  of  the  estimation  of  P(A.  Dj^,T) 
over  P{A.T,D.)  appears  when  we  consider  the  consistency  of  the  compar¬ 
ative  values.  The  indexer  looks  at  each  document,  then  runs  through  the 
various  possible  index  terms  which  apply.  In  general  P(A.  will 

vary  over  a  much  larger  range  than  P(A.P,D.)  as  j  varies  and  therefore 
it  is  easier  psychologically  for  the  indexer  to  correctly  rank  the  values 
over  the  larger  range.  Furthermore,  errors  in  the  weighting  (due  to 
the  estimation  of  P(A.  Dj^,T))will  have  a  smaller  effect  on  the  final  compu¬ 
tation  than  if  P(A.F,Dj^)  is  estimated  initially  and  P(A.  D^,  I^)  computed 
subsequently. 
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1 . 9  Requests  as  Boolean  Functions 

Before  looking  at  the  computational  procedure  for  deriving  the  rele¬ 
vance  number  given  any  arbitrary  request  R,  we  must  explain  the 
meaning  of  the  language  of  the  request.  We  allow  two  logical  opera¬ 
tions  between  index  terms;  viz.  ,  "and”  and  "or".  We  abbreviate  "Ij^ 
or  I^"  by  V  ^2"’  "^1  ^2”  first  is  called  a  disjunc¬ 

tive  request,  the  second,  a  conjunctive  request.  We  now  ask:  If  "Ij" 
and  "I2"  3-re  names  of  subjects,  can  "Ij^.12"  and  "1^^  v  I2''  also  be  names 
of  subjects?  As  a  matter  of  fact  it  is  convenient  to  answer  this  in  the 
affirmative.  The  different  interpretations  of  the  logical  combinations 
^  *2’  used  in  request  formulations  are  shown  in  Table  1. 


Request: 

I1.I2 

^^^2 

Logical  Meaning 

User  requests  informa- 

User  requests  infor- 

tion  on  the  "subject" 

mation  on  the  "subject" 

designated  by 

designated  by  1^  v  I2 

Retrieval  Instruction 

Search  for  documents 

Search  for  documents 

Meaning 

indexed  under  I.  and 

'2 

indexed  under  I,  and 
search  for  documents 
indexed  under  I2 

Class  Meaning 

User  obtains  documents 

User  obtains  documents 

indexed  under  both  1. 
and  I, 

indexed  under  I.  or  I, 

or  both 

Table  1.  Interpretation  of  Logical  Connectives 


Note  how  the  "v"  inside  a  retrieval  prescription  becomes  an  "and"  in 
the  retrieval  instructions.  We  can  say  that  a  disjunctive  request  is 
actually  several  requests  but  the  searches  are  to  be  conducted  simul¬ 
taneously.  The  class  meaning  defined  above  has  a  simple  geometric 
interpretation  (a  Venn  diagram): 
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The  circle  on  the  left  represents  those  documents  indexed  under 
on  the  right  those  indexed  under  The  overlap  gives  all  documents 

obtained  by  requesting  and  the  entire  area  all  those  documents 

obtained  by  requesting  v  I^- 

1.10  The  Extension  of  the  Weight  Function 

By  extending  the  notation  for  a  request  to  include  logical  combinations 
of  tags,  we  can  consider  every  request  R  (i.e.  ,  every  Boolean  function 
of  index  terms),  as  an  event  class.  For  example  R  =L,  R 
R  =  L  V  Ij^,  etc.  By  the  development  in  section  1.6,  we  see  that  if  it  is 
possible  to  compute  P{A.  R)  then  we  can  rank  documents  according 
to  probable  relevance  by  taking  the  relevance  number  to  be 

P(A,  D.)-P(A.  D.,R); 

for,  by  the  inverse  probability  calculation 

P(A.R,D  )  -( - i -  ]  .  P(A,  D  ).P(A.D  R),  (10) 

^P{A,R)  J 

so  that  P(A.  R,  Dj^)  is  proportional  to  P(A,  P(A.  R).  Now  we  note 
that  by  (9),  P(A.  D^^.R)  is  an  extension  of  the  modified  weight  function 
in  the  sense  that; 

If 
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then. 


(11) 


w..  =  w.(R)  =  P(A.D.,R). 

Thus  the  problem  is  to  extend  the  function  whose  values  are  given 

only  for  1^,  .  .  .  ,  I^,  to  any  Boolean  function  of  these  terms.  We  denote 
this  extension  by  "<j^(R)"  and  we  require  this  extension  to  satisfy  the  rules 
of  probability  since  we  intend  for  it  to  be  an  estimate  of  P(A.  D^,  R) .  In 
particular,  we  require: 

0^W.{R);^1,  (12) 

u,.(l^  V  I^)  +  «-(I^.  I2)  =  «i(Ii)  +  ‘-i(l2)-  (14) 

We  note  the  important  fact  that  (14)  allows  us  to  compute  the  weight  of 
a  disjunction  if  the  weight  of  a  conjunction  is  known.  Successive  appli¬ 
cations  of  (14),  combined  with  logical  transformations,  allow  the  weight 
of  any  request  to  be  written  as  additions  and  subtractions  of  weights  of 
single  terms  or  conjunctions.  Thus  the  problem  of  the  extension  of  the 
weight  function  is  reduced  to  the  extension  to  conjunctions.  For  these 
weights  we  also  have  certain  restrictive  conditions.  If  we  let  p  =  (0^(1^) 
and  q  =  ^^^(I^),  then  it  can  be  shown  that  (o.  (I^.  I^)  must  be  less  than  or 
equal  to  the  minimum  of  the  two  numbers  p  and  q  and  must  be  greater 
than  or  equal  to  p  +  q  -  1  if  this  is  positive,  otherwise  it  must  be  greater 
or  equal  to  0.  We  write  this  condition  as: 

max  1^0,  p  +  q  -  ij  w^(I ^  .  I^).^  min  T  p,  q~|  .  (15) 

We  have  decided  to  take  as  the  initial  w-value  of  a  conjunction  its  inde¬ 
pendence  value;  i.  e.  , 


o,i(Ii 


■  1?)  = 


il  i2' 


(16) 
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The  relevance  number  for  a  conjunction  1^.  is  then  given  by 

P(A,  “i2’ 

and  the  relevance  number  for  a  disjunction  v  becomes  by  (14) 

P(A.D.).  [«ii  +‘-i2  -  • 

Several  remarks  need  to  be  made  about  use  of  the  independence  value. 
Note  that  we  do  not  say  that  the  tags  are  independent--in  fact  they  are 
not--but  the  word  'estimate'  is  useful  to  avoid  making  a  false  assump¬ 
tion.  First,  we  estimate  01.(1. .  I,)  by  to  •  CO  .  Second,  weusethein- 
dependence  value  relative  to  the  class  D^,  that  is,  we  take 

P(A.  D..Ij,  I^)  =  P(A.  D-.I^),  (17) 

but  not 

P(A.Ij,l2)  =  P(A,  I^).  (18) 

We  believe  the  former  estimate  is  more  accurate  than  the  latter.  In 
Part  II,  2.  6  we  discuss  a  coefficient  of  association  between  index  terms. 
This  coefficient  which  we  call  Q  lies  in  the  interval  j^j"  ^  with  Q  =  0 
being  the  point  of  independence.  The  joint  occurrence  of  two  events 
will  have  a  probability  in  excess  of  its  independence  value  only  if  the 
corresponding  value  of  Q  is  positive.  We  have  two  intervals  to  sche¬ 
matize  this  situation:  (p  and  q  are  the  probabilities  of  the  separate 
events  and  Q  their  coefficient  of  association) 

75  Q  -  >  I 

Interval  of  Positive  Association 


p-q'^  Probability- —  >  ^min|^p,q 

Corresponding  Interval  of  Probability  Value 
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An  investigation  of  the  statistical  correlation  between  tags  via  the  com¬ 
putation  of  Q  and  then  a  subsequent  study  of  which  pairs  of  tags  were 
used  in  requesting  shows  that  Q  had  positive  values  for  almost  all  of 
these  pairs.  This  indicated  that  computations  were  called  for  with 
estimates  of  I^)  taken  at  the  upper  end  of  the  scale;  i.  e.  ,  where 

o^idj.I^)  =  min  jjOi^,a>i2^.  (19) 

The  results  were  not  as  successful  as  using  the  independence  value. 

A  possible  explanation  lies  in  noting  that  independence  is  a  three  term 
relation  as  formulas  (17)  and  (18)  show.  It  could  well  be  that  the  prob¬ 
ability  value  for  tags  and  relative  to  the  reference  class  A  lies 
closer  to  the  maximum  value  (min  ^p,  q^  );  while  the  probability  value 
for  and  relative  to  A.  lies  closer  to  its  independence  value. 

For  our  initial  estimates  we  assume  this  to  be  the  case. 

1.11  Estimation  and  Correction 

We  have  given  a  formal  clarification  of  the  notions  behind  Probabilistic 
Indexing.  We  see  that  the  computation  of  a  relevance  number  requires 
for  the  single  request  the  quantities  P(A,D.),  for  Boolean  func¬ 

tions  the  quantities  P(A,D^),  w. .,  and  the  w-values  for  conjunctions. 

The  next  problem  is  to  obtain  these  quantities  and  this,  in  turn,  involves 
two  problems;  viz.  , 

(a)  the  initial  estimation;  i.  e.  ,  the  estimates  before  library 
statistics  are  obtained; 

(b)  the  correction  of  the  initial  estimates  as  library  statistics 
are  accumulated  and  subsequent  periodic  revisions  via  a  feed¬ 
back  computation. 

These  two  problems  are  called  the  "problem  of  estimation"  and  the  "prob¬ 
lem  of  correction".  Note  that  a  solution  of  the  second  problem  must  be 
qualified  by  two  requirements:  (1)  the  effects  of  the  initial  estimates  must 
die  out  as  library  statistics  are  accumulated,  (2)  the  solution  must  not  in¬ 
volve  an  impractical  amount  of  computation. 
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1.  12  The  Problem  of  Estimation 


Consider  first  the  estimation  of  P(A,  D^)-  For  real  libraries  where 
no  statistics  are  available  we  are  confronted  with  this  problem  (as  for 
example,  in  the  cases  when  the  library  has  not  yet  been  used  or  when 
new  documents  are  added).  One  possibility  for  setting  the  library 
system  in  motion  is  to  take  all  initial  values  equal;  i.  e.  , 

P(A.  Dj)  =  P(A,  D^)  -  .  .  .  . 

Alternately  we  can  construct  a  more  realistic  distribution  ("more  re¬ 
alistic"  because  our  method  leads  to  a  distribution  which  corresponds 
more  nearly  to  the  actual  distribution  for  some  large  libraries;  viz.  , 
a  non-linear  (hyperbolic)  distribution)  The  considerations  used  in 
such  a  simulation  are.  1)  that  a  correlation  exists  between  the  prob¬ 
ability  that  a  given  tag  will  be  used  in  indexing  and  the  probability  that 
it  will  be  used  in  requesting;  2)  if  document  has  a  higher  a  priori 
probability  than  document  then  probably  has  tags  that  does 
not  have  and  that  are  used  more  frequently  in  requesting.  Thus 
P(A,  Dj^)  should  depend  on  the  extent  that  covers  the  subjects  desig¬ 
nated  by  the  library  tags  and  also  on  the  scope  of  its  individual  tags; 
i.  e.  ,  on  the  relative  frequency  with  which  the  tags  are  used  in  indexing. 
We  therefore  take  as  initial  value  of  P(A,  D^): 

a  N.-  w.., 

th 

where  N.  is  the  number  of  documents  to  which  the  j  index  term  is 

J  .th 

applied  with  non-zero  weight;  w. .  is  the  weight  with  which  the  j  in- 

th 

dex  term  applies  to  the  i  document,  and  a  is  the  normalization  fac¬ 
tor  (i.  e.  ,  the  value  that  gives 

P(A,D  )  =  1.  ) 

i  =  1 
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By  an  argument  similar  to  the  above  we  find  that 

Sp(A,  D  )•  w 

■i  J 


is  a  plausible  estimate  of  P(A,  L).  This  estimate  also  has  the  virtue 
of  forcing  the  initial  value  of  the  factor  in  (8),  section  1. 6, to  equal 
unity.  Thus,  initially, 


The  only  problem  that  remains  is  the  estimate  of  the  co-values  for  con¬ 
junctions.  We  have  decided  to  choose  the  value 

forw.  (I,.I-),  for  reasons  described  in  section  1.  10. 

1  1  2' 

1.13  The  Problem  of  Correction 

Consider  now  the  correction  of  P(A,  D.).  Let  P  be  its  initial  estimate. 
(The  subscript  "i"  will  be  fixed  throughout  the  discussion.)  After  n 
uses  of  the  library,  let  n  be  the  number  of  times  that  the  i'^^  document 
has  been  used.  The  empirical  estimate  of  P(A,  D^)  is  therefore  n/n. 

We  want  to  combine  this  with  in  some  way.  Let  us  do  so  by  the 
following  device:  Annex  to  the  sequence  of  events  of  class  A(i.e.  ,  A^^, 
A^,  .  .  .  ,  A^)  a  fictitious  initial  sequence  of  length  n^  and  suppose 
that  this  initial  sequence  has  given  the  relative  frequency  P^,  then  the 
total  sequence  of  length  n^  +  n  will  give  the  relative  frequency 

n  +  n  •  P 

T-»_  OO 


Thus  by  a  suitable  choice  of  n  we  can  control  the  effect  of  P  on  the 

th  °  ° 

n  estimate,  P  .  For  example,  if  n  =0  then  P  has  no  effect  on  the 
n  _  o  o 

computation  (P  =  n/n);  while  if  n  -  co  then  P  has  its  maximum  effect 
^  '  n  o  o 

(P  =  P  ). 
no 
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Formula  (21)  gives  a  satisfactory  estimate  of  P(A,  D^)  in  a  stationary 
system;  i.  e.  ,  when  the  reference  class  A  does  not  vary  with  time  so 
that 


P(A.D.)  .limP^  (22) 

n— >  oo 

holds.  But  in  reality  this  will  not  be  the  case  since  the  popularity  of 
documents  will  vary  with  time  and  therefore  we  must  look  for  a  pro¬ 
cedure  in  which  the  most  recent  statistics  have  the  most  important 
influence  on  an  estimation  of  P(A,D^).  We  present  a  method  that  takes 
this  consideration  into  account  as  well  as  being  suitable  for  machine 
computation. 


First,  it  will  be  convenient  to 

(a)  compute  P^  periodically;  i.e.,  after  sequential  blocks  of 
fixed  size,  say  m; 

(b)  store  only  statistics  on  the  block  presently  occurring  and  the 
previous  estimate  of  P(A,  D^^). 


We  propose  the  following  computing  schema:  let  nn  be  the  number  of 

th  th  ^ 

limes  that  the  i  document  was  used  in  the  k  block  of  length  m,  then 

take  as  the  first  estimate  of  P(A,  D.) 


P 


(1) 


m,  +  n  ■ 
1  o 


m  -(-  n 

o 


P 

o 


(23) 


th 

and  as  the  k  estimate 


,(k) 


m,  +  n  *  P 
k  o 


(k-l) 


m  +  n 


(24) 


(k) 

Let  us  see  how  the  "block”  relative  frequencies  are  involved  in  P'  '. 

Let  "A"  stand  for  "n  /  (m  +  n  )".  Then  it  can  be  shown  that 

o  '  o' 
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Thus  P'  'is,  what  is  called,  a  convex  linear  combination  of  the  rela¬ 
tive  frequencies 


mj^  j/m,  .  .  .  ,  m^/m, 


with  weights 


Tliat  the  linear  combination  is  convex,  i.e.  ,  that  the  sum  of  the  weights 
is  one,  is  seen  from  the  fact  that 


(26) 


We  see  that  the  sequence  of  weights  diminishes  so  that  the  more  recent 

(k) 

relative  frequencies  more  strongly  influence  the  value  of  P'  We 
also  see  that 


lim  =  O 
h->oo 


(27) 


so  that  the  effects  of  the  earlier  statistics  die  out  as  k  increases.  As 
an  example  we  note  the  special  case  where  =  m;  then  A  =  1/2  and 
we  have  the  weights: 

1/2,  1/4,  .  .  .  ,  1/2*^,  1/2^. 

As  a  final  remark  on  the  computation  of  P{A,D^)  we  emphasize  that 
caution  is  required  in  the  use  of  our  computing  schema  as  given  by 
(23)  and  (24).  The  possibility  exists  that  the  instances  of  use  of  the 


library  may  attain  a  large  number  (relative  to  the  number  of  documents) 
during  a  period  of  time  when  conditions  are  sufficiently  stationary  to 
enable  us  to  say  that  (22)  holds.  In  this  case  it  can  be  shown  that  the 
recursive  procedure  is  assured  of  working  only  if  we  take  m  so  large 
that  each  value  P'  '  closely  approximates  P(A,Dj^).  But  in  that  case 
we  would  not  like  to  be  committed  to  P^  while  the  first  sequential  block 
is  occurring.  Thus  it  is  suggested  that,  for  libraries  of  this  type,  we 
revert  to  some  form  of  (21)  with  computations  performed  at  shorter 
intervals . 

The  correction  of  P( A,  L)  follows  lines  similar  to  the  foregoing  dis¬ 
cussion  of  P(A,D^).  There  is,  however,  a  rather  subtle  question  in¬ 
volved  in  the  processing  of  the  relative  frequency  data.  That  is  to 
say,  all  the  probabilities  are  determined  if  we  know  the  probabilities 
of  conjunctions;  but  many  requests  will  be  given  as  disjunctions  and 
thus  if  a  library  user  requests  information  on  the  subject  designated 
by  V  then  this  should  affect  the  relative  frequency  of  the  requests 

and  the  requests  I^.  A  counting  procedure  for  distinctive  requests 
must  therefore  be  established.  The  best  possibility  seems  to  be  to 
give  partial  "credit"  to  each  disjunct  in  an  instance  of  a  disjunction. 

The  best  way  to  do  this  is  still  open. 

The  next  item  is  the  modification  of  the  w-values.  In  principle,  if  per¬ 
fect  accuracy  were  required,  we  would  need  the  determination  of 
P(A,  R)  where  R  is  any  conjunctive  request;  for,  if  u)^^^^(R)  is  the  k^^ 
estimate  of  Wj^(R)  and  P^^^(A,  R)  is  the  k*^^  estimate  of  P(A,  R)  then 

^_(k+l)^R)  -  P^^^A.  R)y/^.  P^^^A,  D.)-  ui.^^^(R)  (28) 

The  w-values  for  any  request  would  then  be  obtained  by  the  extension 
of  formula  (14)  mentioned  in  section  1.  10.  However,  because  even 
for  a  small  number  of  index  terms,  the  number  of  possible  conjunc¬ 
tions  is  enormous  and  therefore,  practical  considerations  would  prob¬ 
ably  limit  the  application  of  formula  (28).  Suppose  therefore,  we 
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settle  for  accuracy  for  only  the  co-values  for  single  terms.  This  should 
be  sufficient  for  the  following  reasons:  (1)  if  we  use  some  reasonable 
computational  procedure- -perhaps  even  incorporating  coefficient  of 
association  data--we  should  obtain  a  sufficiently  accurate  relevance 
number  to  both  order  the  documents  and  to  execute  the  search  strategy 
program  of  Part  II,  section  2,  (2)  the  co-value  and  the  number  of  docu¬ 
ments  retrieved  rapidly  diminishes  as  the  number  of  conjunctive  terms 
increases- -in  either  case  accuracy  in  the  relevance  number  is  not 
required. 

1 .  14  Weighted  Requests 

The  request  language  is  still  rather  limited  even  though  we  allow  all 
combinations  of  index  terms  by  means  of  the  connectives  "and”  and 
"or".  However,  when  we  consider  the  retrieval  instruction  meaning 
of  the  request  1^^  v  (Table  1,  p.  2  1 )  an  obvious  extension  of  the  re¬ 
quest  language  presents  itself.  That  is  to  say,  when  the  requestor 
asks  that  two  simultaneous  searches  be  made,  one  under  1^  and  one 
under  I^,  let  him  now  indicate  which  search  he  regards  as  more  im¬ 
portant.  To  incorporate  such  information  into  a  computational  proce¬ 
dure  we  allow  him  to  give  this  comparative  data  in  the  form  of  numeri¬ 
cal  "request  weights".  We  will  use  the  expression: 

(a)I^  V 

where  a  and  p  are  the  request  weights,  to  represent  this  new  type  of 
request.  More  generally,  we  note  that  conjunctions  will  occur  in  place 
of  I  and  1^  in  this  expression,  each  conjunction  prescribing  a  search 
and  having  an  assigned  weight.  We  can  conceive  of  the  w'eights,  then, 
as  indicating  the  degree  to  which  the  conjunction  describes  the  infor¬ 
mation  requirement  of  the  requestor.  This  is  suggested  when  we  go 
from  the  retrieval  instruction  meaning  of  a  request  to  the  logical  mean¬ 
ing  (Table  1,  p.21).  The  highest  level  of  specificity  that  the  requestor 
can  attain  is  by  means  of  conjunctions.  The  conjunctions  are  (artificial) 
names  of  sub-subjects  and  since  the  requestor  is  uncertain  about  his 
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use  of  tags  in  forming  these  names,  he  will  try  to  avoid  possible  loss 
of  relevant  information  by  using  disjunctions.  Thus  when  we  permit 
him  to  weight  each  of  these  names;  i.  e.  ,  conjunctions,  we  can  treat 
the  weight  as  being  an  indication  of  either  the  degree  of  the  requestor's 
interest  in  the  subject  designated  by  the  conjunctive  set  of  terms  or 
how  closely  it  matches  his  information  requirements.  For  example, 
if  the  conjunctions  reduce  to  single  terms,  then  the  expression 

(.7)Ij  V  (.  3)12 

means  the  requestor  is  interested  in  I  ^  to  the  degree  0. 7  and  in  to 
the  degree  0.3. 

Given  this  interpretation  for  the  notion  of  request  weights  we  must  now 
provide  a  set  of  rules  for  computing  the  relevance  number  of  weighted 
requests.  This  means  that  we  need  to  evaluate 

c..[(a)I^v(P)l2^. 

Two  methods  appear  to  be  reasonable: 

(1)  w.  |la)I^v(p)l2^=  (a)w.(I^)+(p)w.(l2) -(ap)w(l^  .  I2) 

(2)  OJ.  j^(Q)l  v(p)l2^=  (a)Wj^(I^) +(P)a)^(l2)  -  rnin 

Method  (1)  has  the  advantage  of  computational  simplicity  as  well  as  a 
certain  appeal  in  being  a  direct  modification  of  the  weights  in  the  prob¬ 
abilistic  matrix  through  multiplication  of  the  request  weight  with  the 
corresponding  tag  weight  (recall  that  initially  we  take  w^(Ij^.l2)  equal  to 
to.  (I,  )•  CO.  (I,)).  Method  (2)  has  the  virtue  of  giving: 

“i  V  I2). 

thus  reducing  to  the  case  of  the  unweighted  request  when  a  =  p  (up  to 
multiplication  by  a  constant).  In  our  work  on  the  weighted  request  we 
have  used  formula  (1). 
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To  complete  the  discussion  we  look  at  further  possibilities  of  general¬ 
ization.  The  most  obvious  extension  is  the  assignment  of  request 
weights,  not  just  to  conjunctions,  but  to  each  index  term.  We  could 
explicate  the  meaning  of 


as  a  "quantitative  assertion"  of  (analogous  to  the  "quantitative  ne¬ 
gation"  in  Probability  Logic).  The  number  a  would  be  the  degree  of 
assertion  given  to  the  tag  I^.  In  this  interpretation  we  have: 

(0)  =  not  I^, 

(1)1,  =1,. 

The  computational  procedure  which  would  be  best  to  use  here  is  still 
open. 

Another  possibility  is  a  statistical  or  probabilistic  explication  of  re¬ 
quest  weights;  i.e.  ,  an  explication  of  the  type  as  given  for  There, 

the  logic  was  to  go  from  a  comparative  notion  to  a  quantitative  notion, 
then  to  explicate  the  latter.  We  have  a  similar  explicative  problem 
with  the  request  weights  but  no  simple  solution  presents  itself. 
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2.  THE  AUTOMATIC  ELABORATION  OF  THE  SELECTION  PROCESS 


2.  1  Initial  Remarks 

The  technique  of  Probabilistic  Indexing,  as  we  have  seen,  allows  a 
computing  machine,  given  a  request  for  information,  to  make  a  statis¬ 
tical  inference  and  derive  a  relevance  number  for  each  document.  The 
result  of  a  search  is  an  ordered  list  of  those  documents  which  satisfy 
the  request,  ranked  according  to  their  probable  relevance.  We  would 
prefer  to  have  a  technique  which  not  only  decides  of  a  given  class  of 
documents,  which  among  them  is  most  probably  relevant,  next  most 
probably  relevant,  etc.  ,  but  which  also  decides  whether  the  class  it¬ 
self  of  retrieved  documents  is  adequate  (at  least  in  the  sense  of  deter¬ 
mining  whether  or  not  it  excludes  some  documents  which  are  relevant 
to  the  user's  information  needs).  That  is  to  say,  if  we  consider  the 
request  as  a  clue  which  the  user  gives  to  the  library  to  indicate  the 
nature  of  his  information  needs,  then  we  should  raise  the  following 
question:  Given  a  clue,  how  may  it  be  used  by  the  library  system  to 
generate  a  best  class  of  documents  (to  be  ranked  subsequently  by  their 
relevance  numbers)?  Thus  given  the  clue,  how  can  we  elaborate  upon 
it  automatically  in  order  to  produce  a  best  class  of  retrieved  documents? 
Let  us  turn  our  attention  to  this  problem. 

2 .  2  Search  Strategies  and  the  Notion  of  Distance 

A  library  request  (a  clue)  is  a  Boolean  function  whose  variables  are 
index  terms,  which,  in  turn,  selects  a  class  of  documents  via  a  logical 
match.  That  is  to  say,  all  of  those  documents  whose  index  terms  are 
logically  compatible  with  the  logic  and  the  tags  of  a  request  R  consti¬ 
tute  the  class  of  retrieved  documents  C.  Our  goal  is  to  extend  the 
class  C  in  the  most  probable  "direction"  and  this  can  be  done  in  two 
ways.  One  method  involves  the  transforming  of  R  into  R'  where  R'  in 
turn  will  select  a  class  of  documents  C’,  which  is  larger  than  C  and 
contains  more  relevant  documents.  A  second  method  does  not  modify 
R  but,  rather,  it  uses  the  class  C  to  define  a  new  class  C"  .  A  set  of 
rules  which  prescribe  how  to  go  from  a  given  request  R  to  a  class  of 
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retrieved  documents  is  called  a  strategy.  ^  A  strategy,  in  turn,  in¬ 
volves  the  use  of  several  different  techniques  for  measuring  the  "dis¬ 
tances"^  between  index  terms  and  between  documents.  Before  pro¬ 
ceeding,  let  us  introduce  some  further  notations  to  make  more  pre¬ 
cise  what  we  have  been  saying. 

We  understand  by  "basic  selection  process"  the  rule  which  uses  the 
request  to  select  the  class  of  documents  whose  tags  are  logically  com¬ 
patible  with  the  logic  and  tags  of  the  request,  and  we  denote  this  basic 
selection  process  by  the  functional  notation  "f".  Thus  f  is  the  transfer 
function  from  inputs  (requests)  to  output  (class  of  retrieved  documents) 
and  we  write 

f(R)  -  C  (1) 

where,  again,  R  is  the  request  and  C  is  the  class  of  retrieved  docu¬ 
ments.  The  problem  is  to  enlarge  C  so  as  to  increase  the  probability 
that  it  will  contain  relevant  documents  and  to  decrease  the  probability 
that  it  will  contain  irrelevant  documents.  This  can  be  done  in  the  fol¬ 
lowing  way:  Suppose  R'  is  a  request  similar  in  meaning  to  R,  then  we 
can  take  as  a  possible  modification  of  f,  say  f, 

f'(R)  =  f(R)  V  f(R-)  r:  C  V  C-.  (2) 

(As  before  "v"  designates  class  union. ) 


We  mean  here  search  strategy  from  the  viewpoint  of  the  library  com¬ 
puter.  The  requestor  also  has  a  search  strategy  which  is  given  by  the 
relevance  number  of  the  documents  that  he  is  given. 

^We  use  "distance"  in  an  informal  sense;  i.  e.  ,  it  may  not  satisfy  all 
the  axioms  of  distance  (e.g.  ,  the  triangle  inequality  and  symmetry). 
The  reason  for  this  is  that  it  is  frequently  necessary  to  preserve  nat¬ 
ural  logical  structure  and  forego  artificial  metric  structure.  Indeed, 
in  one  case  we  violate  positiveness  of  distance  functions. 
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This  modification  can  be  made  precise  if  we  are  able  to  invent  a  metric 
or  "distance"  function  on  the  request  space  to  measure  dissimilarity- 
in  meaning.  Since  we  are  not  sure  what  meaning  is,  much  less  being 
able  to  assign  a  numerical  quantity  to  it,  this  is  rather  difficult;  but 
we  shall  show  later  that  statistics  can  provide  such  measures.  For 
the  present,  suppose  we  actually  do  have  such  a  metric,  then  we  can 
generate  a  modified  selection  function  f  by  defining  f'(R)  1°  be  the 
union  of  all  classes  f(R')  where  the  distance  between  R  and  R'  is  less 
than  some  specified  number,  say  e.  Symbolically  this  written 

f(R)  =  U  f{R'),  (3) 

[d(R,R')<6] 

Analogously,  if  we  have  a  "distance"  function  in  the  document  space 
which  gives  "nearness"  as  a  numerical  measure  of  similarity  of  infor¬ 
mation  content,  then  a  completely  different  modification  f"  of  f  arises 
via 

f"(R)  =  C"  (4) 

where  C"  consists  of  all  documents  whose  distance  from  C  =  f(R)  is 
less  than  e.  (We  remark  in  passing  that  "distance"  notions  seem  to 
present  a  surprisingly  fruitful  approach  to  the  library  problem;  e.g.  , 
the  relevance  number  itself  can  be  thought  of  as  giver^the  nearness 
between  documents  and  requests.) 

Thus,  we  see  that  a  machine  strategy  can  elaborate  upon  the  basic  se¬ 
lection  process  in  order  to  improve  the  search  in  one  of  two  different 
ways.  The  first  is  to  establish  a  metric  for  distance  in  request  space 
so  as  to  formulate  R',  given  R.  The  other  way  is  to  use  the  class  of 
documents  C,  obtained  by  the  initial  request  R,  to  define  a  new  class 
C"  .  Both  of  these  methods  are  discussed  below. 

2.  3  The  Notion  of  Index  Space 

Geometrically  speaking,  one  may  think  of  the  set  of  n  index  terms 
which  constitute  the  library  catalogue  "vocabulary"  as  points  in  an 
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n-dimensional  space.  The  points  in  this  space  are  not  located  at  ran¬ 
dom,  but  rather,  they  have  definite  relationships  with  respect  to  one 
another  depending  on  the  meanings  of  the  terms.  For  example,  the 
term  "logic"  would  be  much  closer  to  "mathematics"  than  to  "music". 
One  always  finds  when  looking  up  index  terms  in  the  catalogue  of  a 
conventional  library,  other  terms  listed  under  "see"  and  "see  also". 
This  cross -indexing  ("see/see  also")  aspect  of  a  library  indicates 
some  of  the  relationships  that  index  terms  have  for  one  another;  i.  e.  , 
it  indicates  some  of  the  relationships  between  points  in  index  space. 

The  "distances"  between  index  terms  can  be  made  explicit  by  formu¬ 
lating  probabilistic  weighting  factors  between  them.  Once  numerical 
weighting  factors  are  coordinated  with  the  distances  the  c ros s -indexing 
aspect  of  a  library  can  be  mechanized  so  that  given  a  request  involving 
one  (or  many)  index  terms,  a  machine  could  compute  other  terms  for 
which  searches  should  be  made.  That  is  to  say,  a  request  places  one 
at  a  point,  or  several  points,  in  index  space  and  once  the  distances  be¬ 
tween  points  are  arithmetized,  a  machine  could  determine  which  other 
points  to  go  to  in  order  to  improve  the  request.  Thus,  the  elaboration 
of  a  request  on  the  basis  of  a  probabilistic  "association  of  ideas"  could 
be  executed  automatically. 

2 .  4  Automatically  Groping  in  Index  Space 

There  are  at  least  two  different  kinds  of  relationships  that  can  exist 
between  the  points  in  index  space;  viz.  ,  semantical  relationships  and 
statistical  relationships.  The  most  elementary  semantical  relation¬ 
ship  is  that  of  synonymity,  but  in  addition  to  synonymity  there  are  other 
semantical  relationships  such  as  "partially  implied  by"  and  "partially 
implies".  Such  relationships  between  terms  are  based  strictly  on  the 
meanings  of  the  terms  in  question- -hence ,  the  word  "semantical". 
Another  class  of  relationships  are  statistical;  i.e.  ,  those  based  on  the 
relative  frequency  of  occurrence  of  terms  used  as  indexes.  The  dis¬ 
tinction  between  semantical  and  statistical  relationships  may  be  clari¬ 
fied  as  follows:  Whereas  the  semantical  relationships  are  based  solely 
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on  the  meanings  of  the  terms  and  hence  independent  of  the  "facts”  de¬ 
scribed  by  those  words,  the  statistical  relationships  between  terms 
are  based  solely  on  the  relative  frequency  with  which  they  appear  and 
hence  are  based  on  the  nature  of  the  facts  described  by  the  documents. 
Thus,  although  there  is  nothing  about  the  meaning  of  the  term  "logic" 
which  implies  "switching  theory",  the  nature  of  the  facts  (viz .  ,  that 
truth  -  functional  logic  is  widely  used  for  the  analysis  and  synthesis  of 
switching  circuits)  "causes"  a  statistical  relationship.  (Another  ex¬ 
ample  might  concern  the  terms  "information  theory"  and  "Shannon"-- 
assuming,  of  course,  that  proper  names  are  used  as  index  terms.  ) 

Once  the  various  "connections"  between  the  points  of  index  space  have 
been  establishec  rules  must  be  formulated  which  describe  how  one 
should  rricjve  in  the  luaze  of  connected  points.  We  call  such  rules 
"heuristics"  They  are  general  guides  for  groping  in  the  "maze"  in 
the  attempt  to  create  an  optimal  output  list  of  oocuments  for  any  arbi¬ 
trary  request.  The  heuristics  would  enable  a  machine  to  decide,  for 
a  given  set  of  request  terms,  which  index  terms  to  "see"  and  "see 
also",  ana  how  deen  this  search  should  be  and  when  to  stop,  etc. 
Generally  speaking,  the  heuristics  would  decide  which  index  terms  to 
look  at  next,  on  the  basis  of  the  semantical  and  statistical  connections 
between  terms,  and  the  heuristics  would  decide  when  to  stop  looking, 
on  the  basis  of  the  number  of  documents  that  would  be  retrieved  and 
the  relevance  numibers  of  these  documents.  (Remember  that  each  point 
in  index  space  defines  a  class  of  documents;  viz .  .  all  of  those  docu¬ 
ments  which  have  been  assignee  the  index  term  in  question  with  a  non¬ 
zero  weight.  )  Given  this  understanding  of  heuristics,  we  see  that  an 
over-all  sear  ch  st  rategy  i  s  made  up  of  components  some  of  which  are 
heuristics;  i.e.,  the  sequence  of  devices,  rules,  heuristics,  etc., 
which  lead  from  inputs  (requests)  to  outputs  (classes  of  retrieved  docu¬ 
ments)  is  the  strategy. 
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2. 5  Some  Elementary  Heuristics 


In  order  to  clarify  the  notion  of  developing  heuristics  which  would  de¬ 
termine  how  a  ccrr.puter  should  "grope"  in  index  space,  consider  the 
following  exarriple.  Assume  that  we  compute  the  frequency,  N(L), 
with  which  each  term  is  used  to  tag  a  document,  and  also  that  we  com¬ 
pute  the  frequency,  N(I , .  I,  ),  with  which  pairs  of  terms  are  assigned  to 

J 

documents.  We  can  then  coriipute  the  conditional  probability  P(I.,I,  ) 

J 

that  if  a  term  H  is  assigned  to  a  document,  then  Ij^  also  will  be  assigned: 


N(l.) 


(5) 


We  do  this  for  all  pairs  1.  I,  . 

J  I' 


probability  giver.  i.e. 


Assume  now  that  L'  is  the  index  term  which  has  the  highest  conditional 

is  the  index  term  for  which  PfI  .  1.  )  is  a 

J  j  ^ 

maximum.  Then  given  a  request,  R  =  L,  for  all  documents  tagged 

with  I.,  we  form  a  new  request,  R'  -  1.  v  I.',  which  searches  for  all 
J  ^  J  J 

documents  tagged  with  either  or  ly  , 

sider  R'  instead  cf  R 


Thus  the  rule  is  now  to  con- 


This  heuristic  tells  us  which  tags  are.  closest  (in  one  sense)  to  given 
ones,  but  we  still  have  nc  measure  of  the  "closeness"  (hereafter  written 
without  quotes)  and  such  a  measure  is  needed  as  a  part  cf  the  associated 
computation  rule.  That  is  to  say  we.  elaborate  upon  R  and  obtain  R' 
by  searching  for  documents  indexed  under  tags  closely  related  to  those 
in  the  original  request,  but.  clearly,  the  relevance  numbers  that  we 
derive  for  these  "additional"  documents  should  be  weighted  down  some¬ 
what  in  order  to  indicate  that  they  were  obtained  only  from  tags  which 
are  close  to  those  in  the  original  request.  We  measure  the  closeness 
as  follows:  Let  p^  -  riornialize  p^  over  the  set  of  tags  used 

in  the  request  so  that 


(6) 
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Now,  instead  of  using  w.(l.')  (the  weight  assigned  to  I.'  for  the  i^^  doc- 

t  J  _ J 

ument)  in  the  search  computation,  we  replace  it  by  •  Wj^ 

extended  search  that  we  have  just  described  is  an  elementary  form  of 
only  one  of  a  class  of  possible  heuristics  based  on  the  statistical  re¬ 
lationships  between  tags. 


■:*  ol 


m 


(I.  implies  I,  to  the  greatest  degree) 
J  ^ 


A  second  elementary  heuristic  which  looks  even  more  promising  is 
called  the  "inverse  conditional"  search  and  it  measures  closeness  of 
tags  to  Ij  in  terms  of  the  conditional  probability  from  Ij^  to  h  (instead 
of  conversely  as  with  the  heuristic  described  above).  That  is  to  say 
we  compute  that  P(l,  ,  I.)  which  is  maximum  as  1,  varies  and  this  pro- 

K  J  K 

vides  the  tag  which  most  strongly  implies  the  given  tag  1..  Thus,  in¬ 
stead  of  asking  for  that  tag  which  is  most  strongly  implied  (statistically) 
by  an  arbitrary  tag  in  the  request,  we  ask  for  the  tag  which  most 
strongly  implies  (statistically)  the  given  tag.  Using  this  method  to  de¬ 
termine  the  closeness  of  tags  we  establish  a  measure  for  the  closeness 
by  normalizing  the  probability  as  before.  That  is. 
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pci^.V 

- 


and,  again,  the  corresponding  computation  rule  is  now  pyw^^CL), 
where  '1^  is  the  which  makes  Ij)  ^  maximum  for  a  given  1^. 


ol. 

J 


I  O 
m 


(Ij  implied  by  Ij^ 


to  the  greatest  degree) 


2.6  A  More  Sophisticated  Heuristic 

We  have  just  discussed  two  possible  measures  of  closeness;  viz .  ,  the 

conditional  probability  ),  and  the  inverse  condition  probability 

J  ^ 

P(I,  ,1.).  Now  we  consider  a  third  statistical  measure  which  appears 
^  J 

the  most  promising  of  the  three.  This  is  one  of  several  possible  co¬ 
efficients  of  association  between  predicates^.  The  particular  coefficient 


^G.U.  Yule,  "On  Measuring  Association  Between  Attributes",  Journal 
of  the  Royal  Statistical  Society,  Vol  LXXV,  1912,  pp  579-642. 
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•we  have  chosen  arises  in  the  following  wa-y.  Consider  the  tags  1^  and 
and  partition  the  library  by  four  classifications;  viz.  ,  documents 
indexed  under  both  I.  and  I,  ,  those  indexed  under  I.  but  not  I,  ,  those 

J  k’  j  k 

indexed  under  I,  but  not  I.,  and  those  not  indexed  under  either.  Let- 

_  k  J 

ting  'Ij '  denote  the  complement  of  the  class  L,  etc.  ,  these  four  classes 
are  given  by  I..  I,  ,  I--TL.  l"..!,  ,  ri.IT,  respectively.  The  classification 
and  the  number  of  documents  is  shown  most  conveniently  in  a  table: 


x  =  N(I..y 

u^N(i..y 

N(I.) 

V  =  Nir . 

y  =  Nir.r,^) 

N(r.) 

N(y 

r  N(i;) 

n 

We  have  adjoined  to  the  table  the  row  and  column  sums  and  n  (the  to¬ 
tal  number  of  documents). 

Now,  using  the  notation  of  section  2.  5,  we  say  that  L  is  statistically 

independent  of  I  if 
-  k 

p(i.,y  =  p(ij^).  (7) 

This  can  be  shown  to  be  equivalent  to: 

P(P.y  -  P(I.)-P(y;  (8) 

so  that  rewriting  in  terms  of  frequencies  we  have  an  additional 
equivalence: 


I 


We  can  infer  also  that  the  following  are  equivalent: 

(a)  Ij  is  statistically  independent  of 

(b)  is  statistically  independent  of  1^; 

(c)  is  statistically  independent  of 

(d)  is  statistically  independent  of  ly 

(e)  is  statistically  independent  of 


For  any  pair  I.,  I,  (9)  suggests  that  we  look  at  the  excess  of  ) 

J  K  J  K 

over  its  independence  value;  i.e.,  the  quantity 


6(F,y  =  N(I..y  -  N(F)-N(Ij^)/n.  (10) 

It  can  be  shown  that  this  function  6  has  the  property 

6(1.,  (11) 

and  thus  6  is  associated  with  the  differences  over  independence  values 
in  all  four  classifications. 


Having  discussed  independence  let  us  now  consider  what  properties 

would  be  suitable  for  a  coefficient  of  association  between  I.  and  I,  . 

J  k 

We  call  this  coefficient  ''Q(I.,I,)".  (1)  Q(1.,I,)  should  be  zero  when 

J  K  J  K 

6(1.,  I,)  =  0  and,  moreover,  Q(1.,I,)  should  vary  as  6(1.,  I,  )  for  fixed 

J  K  J  K.  J  ^ 

n  and  fixed  row  and  column  totals;  (2)  the  maximum  of  Q(I.,I,  )  should 

J  ^ 

occur  when  I.  is  contained  in  I,  (u  ~  0),  or  I.  is  in  contained  I.  (v  =  0), 

J  k  k  I  —  J  ' 

or  I.  and  I,  give  the  same  class  (u  =  v  =  0);  (3)  the  minimum  of  Q(1.,I,  ) 
J  _  _  J  k 

should  occur  when  1,  is  contained  in  I.  (x  -  0),  or  I.  is  contained  in  I, 

k  J  J  k 

(y  =  0),  or  I.  is  the  complement  of  L  (x  =  y  =  0);  (4)  it  should  have  a 
J 


have  a  simple  range  of  values,  say  from  -  1  to  1 .  A  coefficient  that 
has  all  of  these  properties  is: 

Q(Ij.Ik)  =  ■  uv)/(xy  +  uv).  (12) 

(The  intimate  connection  with  6  is  indicated  by  the  fact  that  the  numer¬ 
ator  of  Q  is  n6.  ) 


The  generation  of  a  heuristic  now  proceeds  by  the  plan  of  section  2,  5. 

Given  R  =  I.  we  select  the  term  L  (different  from  I.)  with  the  maximum 
J  ^  J 

coefficient  Q(I.,Ij^).  This  value  will  be  between  0  and  1  or  no  term 

will  be  selected.  Then  R  is  extended  to 


R  =  I.  V  I, 
J 


and  in  the  search  computation  we  multiply  the  weight  w.(l,  )by  Q(I.,I,  ), 

Ik  J  k 


2 .  7  Heuristics  in  the  Document  Space 

It  seems  that  the  modification  of  the  selection  process  by  means  of  a 
concept  of  closeness,  or  distance,  in  the  request  space  holds  the  best 
promise  for  the  generation  of  satisfactory  heuristics.  However,  so 
that  no  possibility  is  overlooked,  we  now  examine  other  notions  of 
distance.  Turning  our  attention  to  the  modification  by  distance  no¬ 
tions  in  the  document  space  (equation  4,  section  2.2)  we  see  that  the 
procedure  is  to  go  from  the  given  request  R  to  C,  the  class  of  docu¬ 
ments  retrieved  by  the  basic  selection  process  f.  We  then  obtain  C" 
by  applying  the  distance  function  in  the  document  space  to  C.  In  a 
sense  then,  this  two  step  procedure  uses  C  as  a  representation  of  R 


The  coefficient  recommended  by  Yule  loc .  cit.  is  not  Q,  but 
Z  =  (\j  xy  -  \j  uv)  /  \J  xy  +  nJ  uv)  . 


The  range  of  variation  of  both  Q  and  Z  is  the  same  and  since  both  lead 
to  equivalent  heuristics  we  have  chosen  Q  for  its  computational  sim¬ 
plicity.  For  refined  work  we  might  adopt  Z. 
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and  extends  this  representation.  We  distinguish  such  heuristics  from 
those  discussed  previously  which  extend  R  directly.  We  call  these  two 
ways  of  looking  at  R,  the  extensional  and  intensional  interpretations. 

This  situation  is  clarified  if  we  look  at  the  probabilistic  matrix; 


'2 

I 

m 

R 

'^ii 

^^12 

.  .  . 

w  . 

Im 

Wj(R) 

'^21 

^^22 

•  •  • 

w  « 

2m 

W2(R) 

• 

• 

• 

.  .  . 

• 

D 

n 

.  .  . 

w 

nm 

Wn(R) 

We  have  adjoined  to  the  matrix  a  column  giving  the  w -values  for  a 
given  request  R.  Since  the  non-zero  values  in  this  column  characterize 
the  documents  that  are  retrieved  this  represents  the  extensional  in¬ 
terpretation  of  R.  The  values  themselves  in  the  R-column  can  be 
thought  of  as  a  measure  of  closeness  between  R  and  the  documents. 

The  matrix  j  itself  gives  a  representation  of  the  document  space 

and  the  index  space.  To  get  the  intensional  interpretation  of  R  into  the 
schema  we  use  the  following  device:  Write  R  in  so-called  "distinguished 
disjunctive  normal  form" --this  is  a  disjunction  of  conjunctions  in  which 
either  T  or  its  negation  occurs.  For  example,  in  the  space  of  three 
tags  1^,1^,!^  the  request  R  =  •  I2  can  be  written 


R  = 


Il.I,.l3  V 


Ij.I2.I3 


-  45 


Having  done  this,  we  can  represent  each  conjunction  that  occurs  in  R 
th 

by  a  vector  whose  j  component  is  1  if  1^  is  in  the  conjunction,  and 
0  otherwise.  Thus  R  is  represented  by  a  bundle  of  such  vectors  and 
we  have  a  matrix  whose  rows  are  these  vectors: 


1 1 

®12 

®  Im 

21 

®22  •  • 

^  2m 

s  1 

%2 

.  e 

sm 

where  s  is  the  number  of  conjunctions  that  occur  in  the  normal  form. 
We  now  adjoin  this  matrix  to  the  probabilistic  matrix: 


'^11 

'^Im 

'^22  •  • 

'^Zm 

'^n2  •  • 

.  w 

nm 

®11 

®12 

e  . 

Im 

®sl 

®s2  ■  • 

e 

sm 

intensional  rep¬ 
resentation  of  r 


A  simple  heuristic  can  be  generated  from  this  schema  by  replacing 
the  non-zero  w-values  by  I's,  thus  obtaining  the  binary  library  matrix. 
The  square  of  the  Pythagorean  distance  between  rows  gives  the  num¬ 
ber  of  positions  in  which  the  rows  differ.  We  say  two  rows  are  0-away, 
1-away,  etc.  ,  ^  depending  on  the  value  of  the  square  distance.  We  see 


The  notion  of  distance  between  documents  is  analogous  to  the  notion 
of  distance  between  codes  as  discussed  in  the  theory  of  error  correct¬ 
ing  codes . 


-  46 
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that  R  will  retrieve  document  if  and  only  if  there  is  a  row  in  B  that 
is  0-away  from  the  row  of  the  binary  matrix.  (In  fact,  there  will 
be  at  most  one  such  row.  )  In  this  case  we  say  that  is  0-away  from 
R.  We  can  enlarge  the  class  of  retrieved  documents  by  considering 
also  documents  that  are  1-away  from  R,  2-away,  etc. 

Enlarging  the  class  by  this  method  is  not  completely  satisfactory.  We 
would  really  like  to  introduce  these  notions  into  a  (generalized)  rele¬ 
vance  number  computation.  That  is  to  say,  we  would  like  to  combine 
heuristics  in  such  a  way  that  documents  with  associated  ranking  num¬ 
bers  are  retrieved,  not  just  classes  of  documents.  We  would  also  like 
to  use  the  values  w^(R)  in  the  computation. 

First  we  note  that  the  Pythagorean  distance  between  two  rows  of  the 
probabilistic  matrix  gives  a  measure  of  dissimilarity  of  information 
content  (as  well  as  dissimilarity  of  distribution  of  information)  between 
documents  corresponding  to  these  rows.  Call  this  distance  '' A  (Dj, ,  D^)”. 
We  can  use  this  distance  function  to  compute  the  distance  of  any  docu¬ 
ment  from  the  class  C  of  documents  retrieved  by  the  basic  selection 
process.  This  is  all  the  theory  required  to  implement  formula  (4) 
(section  2.2). 

Next  is  the  problem  of  computing  the  (generalized)  relevance  number. 
There  are  infinitely  many  possibilities  here  and  which  is  "best"  is 
still  an  open  problem.  However,  an  extremely  natural  one  arises  as 
follows:  We  have  pointed  out  that  the  values  w^{R)  in  the  R-column  of 
the  probabilistic  matrix  can  be  considered  as  a  measure  of  closeness 


Another  measure  of  dissimilarity  is  to  take,  not  the  Pythagorean  dis¬ 
tance  but  the  sum  of  the  absolute  values  of  the  differences  between  cor¬ 
responding  entries;  i.e.,  2|'^ik  ~  '^’k  I 
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between  R  and  the  documents.  To  combine  these  values  withA(Dj^,  D.) 
we  convert  closeness  to  "distance"  by  some  device  such  as  consider¬ 
ing  the  negative  of  the  logarithm  of  Wj^(R).  We  define 


d{R,D.) 


-  log  w^(R). 


(13) 


will  be  retrieved  by  R  if  and  only  if  d(R,  D^)  is  finite;  thus  this  char¬ 
acterizes  the  class  of  retrieved  documents.  Now  take  that  document 

D.  in  the  c 
1 

minimum. 


lass  of  retrieved  documents  such  thatA(D.,D.)  is  a 

1  ^  y 


Then  we  take 


g(R.Dj) 


A(Di.Dj) 


+  log 


'w.(R) 


as  the  measure  of  "distance"^  between  R  and  D.. 

^  J 

retrieved  document,  thenA(Dj^,  D.)  is  zero  and 


(14) 

Note  that  if  D.  is  a 
J 


g(R,  Dj)  =  -  log  Wj(R). 


Furthermore,  if  has  not  been  retrieved  (initially) 


g(R,  D.)  >  -  log  w,(R), 

where  i  is  the  accession  number  of  the  document  nearest  to  D..  Thus 

J 

the  ranking  by  the  g -function  will  always  put  an  adjoined  document  be¬ 
low  its  associated  document  in  the  class  C.  We  may  now  finish  the 
computation  by  subtracting  the  logarithm  of  the  a  priori  probability 
of  a  document  from  its  g-value.  (Analogous  to  multiplying  Wj^(R)  by 
P(A,  Dj^)  to  obtain  the  relevance  number.  )  The  final  heuristic  then,  is 
to  choose  a  suitable  cut  off  point  in  the  list  of  adjoined  documents- - 
taking  only  those  with  (generalized)  relevance  number  less  than  some 
specified  value. 


If  D.  is  not  unique  choose  the  one  in  the  minimal  set  with  the  largest 
w.(R)\ 

2 

Again  (see  note  1,  p  47)  it  might  be  preferable  to  take 
g(R.  D.)  =  A(D.,D.)  +  d(R,D.) 
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Although  simple  in  theory  the  above  heuristic  leads  to  laborious  com¬ 
putations.  Considerable  simplication  results  if  we  restrict  all  com¬ 
putations  to  the  columns  of  the  probabilistic  matrix  corresponding  to 
those  index  terms  mentioned  in  R. 

2 .  8  Further  Remarks  Concerning  Search  Strategies 

We  have  presented  some  of  the  heuristics  that  appear  to  have  the  best 
possiblity  of  being  useful  components  of  a  search  strategy.  We  also 
have  formulated  some  principles  for  a  general  approach  to  the  prob¬ 
lem  of  automatic  elaboration  of  the  selection  process.  Let  us  now 
illustrate  these  ideas  by  constructing  an  over -all  search  strategy. 

First  we  list  the  variables  involved; 

1 .  Input 

(a)  The  request  R 

(b)  The  request  weights 

2.  The  Probabilistic  Matrix  [~w^j  [ 

(a)  Similarity  measures  between  documents,  (e.g.  ,  A -values) 

(b)  Significance  measures  for  index  terms  (An  index  term 
applied  to  every  document  in  the  library  will  have  no  sig¬ 
nificance,  while  an  index  term  applied  to  only  one  docu¬ 
ment  will  be  highly  significant.  Thus  significance 
measures  are  related  to  the  "extension  number"  for  each 
term;  i.e.  ,  to  the  number  of  documents  tagged  with  the 
term--the  smaller  this  number,  the  greater  the  signifi¬ 
cance  of  the  index  term.  ) 

(c)  "Closeness"  measures  between  index  terms  (e.  g.  ,  Q- 
values) 

3.  The  A  Priori  Probability  Distribution 

4.  Output  (by  means  of  the  basic  selection  process;  i.e.  ,  the 
logical  match  plus  Bayes'  schema  with  all  of  its  ramifications  and 
refinements) . 
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(a)  The  class  of  retrieved  documents;  call  this  'C'. 

(b)  n,  the  number  of  documents  in  C. 

(c)  Relevance  numbers. 

5.  Control  Numbers 

(a)  n  ,  the  maximum  number  of  documents  that  we  wish  to 

o 

retrieve . 

(b)  Relevance  number  control;  e.  g.  ,  we  may  ignore  docu¬ 
ments  with  relevance  number  less  than  a  specified  value. 

(c)  Generalized  relevance  number  control.  (Similar  to  the 
above  but  this  applies  to  the  computation  described  in 
section  2.7) 

(d)  Request  weight  control;  i.e.  ,  we  elaborate  on  index  terms 
in  the  request  if  their  request  weight  is  higher  than  some 
specified  value. 

(e)  Significance  number  of  index  term  control;  i.e.  ,  we  give 
index  terms  of  certain  significance  (defined  in  terms  of 
their  extensions)  special  attention. 

6.  Operations 

(a)  Basic  selection  process,  denote  this  by  "f". 

(b)  Elaboration  of  the  request  by  using  ''closeness'' in  the 
request  space.  Denote  this  by  "H".  Thus  the  operation 
H  will  transform  the  request  R  into  a  new  request  R' . 

More  precisely  H  is  the  heuristic:  elaborating  the  index 
terms  in  R  with  request  weights  greater  than  the  request 
weight  control  number  and/or  index  term  significance 
greater  than  a  specified  value. 

(c)  Adjoining  new  documents  to  the  class  of  retrieved  docu¬ 
ments  by  using  "distance"  in  the  document  space.  De¬ 
note  this  by  "h".  Thus  the  operation  h  will  transform  the 
class  C  of  retrieved  documents  into  a  new  class,  say  D. 
Move  precisely,  h  is  the  heuristic:  trim  C  to  documents 
having  relevance  number  greater  than  the  control  number 
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TRIM  C  TO  n-  AND  OUTPUT 


DIAGRAM  3 

A  SEARCH  STRATEGY  BASED  ON  HEURISTICS  OF  ELABORATION  AND  EXTENSION 


and  then  annex  to  C  all  of  the  documents  with  generalized 
relevance  number  in  a  certain  range. 

(d)  Merge:  any  merging  operation  between  two  classes; 

e.g.  ,  forming  their  intersection,  their  union,  trimming 
by  using  relevance  number  and  then  forming  union,  etc. 

Next  we  combine  these  to  obtain  the  strategy  shown  in  diagram  3. 

This  strategy  is  to  be  regarded  as  a  particularly  simple  example,  its 
goal  to  obtain  a  specified  number  of  documents  (say  n^)  having  the 
best  chance  of  satisfying  the  request.  Thus  the  decision  to  elaborate 
centers  on  answering  the  question:  Is  the  number  of  documents  se¬ 
lected  greater  than  or  equal  to  Iri  the  diagram  we  refer  to  the 

heuristic  H  as  simply  "elaborate  the  request".  The  actual  transfer 
function  H  involves  using  control  numbers  to  limit  the  elaboration. 
Furthermore  these  control  numbers  can  be  varied  from  one  applica¬ 
tion  of  H  to  the  next.  Similarly  we  refer  to  the  heuristic  h  as  simply 
"extend  the  class";  but  we  point  out  that  this  too  involves  control  num¬ 
bers.  Finally  a  word  about  the  classes  C,  C,  D,  etc.  These  are 
actually  lists  of  documents  ranked  by  relevance  numbers.  Thus  the 

instruction  "trim  C  to  n  "  means  "cut/off  the  list  to  the  n  documents 

o  o 

with  highest  relevance  number".  The  output  of  the  system  will  be  an 
ordered  list  of  document  accession  numbers. 
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PART  III. 


THE  EXPERMENTAL  RESULTS 
(SUMMARY) 


In.  Part  III  we  describe  some  experiments  that  were  designed  and 
executed  in  order  to  provide  data  for  evaluating  the  effectiveness 
of  the  techniques  of  Probabilistic  Indexing.  The  discussion  of 
Part  II  indicates  that  there  are  two  basic  hypotheses  that  we  wish 
to  verify.  The  first  hypothesis  asserts  that  the  relevance  number 
that  we  compute  for  each  document,  given  a  request  is,  in  fact,  a 
measure  of  the  probable  relevance  of  the  document.  The  second 
hypothesis  asserts  that  the  automatic  elaboration  of  the  selection 
process  does,  in  fact,  produce  relevant  documents  which  are  not 
selected  by  the  original  request. 

Section  1  discusses  the  experimental  set-up;  i.e.  ,  the  library, 
the  indexing  system,  the  weights,  the  testing  procedure,  etc.  Sec¬ 
tion  2  provides  the  data  and  discussion  in  support  of  the  hypothesis 
concerning  the  relevance  function  and  we  find  that  the  results  do 
support  the  hypothesis.  Section  3  provides  the  data  and  discussion 
in  support  of  the  hypothesis  concerning  the  selection  process  and 
we  find  that  the  results  do  support  the  hypothesis. 
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1.  THE  EXPERIMENTAL,  SET-UP 


1 .  1  Initial  Remarks 

The  jumping  off  point  for  our  approach  to  automatic  information  retrieval 
was  the  recognition  that  the  core  of  the  problem  is  that  of  adequately 
identifying  the  information  content  of  documentary  data.  There  is  an  un¬ 
certainty  in  the  relationship  between  the  tags  that  are  used  to  index  docu¬ 
ments  and  the  subjects  that  they  denote  and  this  is  the  cause  of  inadequate 
retrieval  of  desired  information.  Using  the  analogy  of  going  from  an  in¬ 
coming  document  to  its  set  of  index  tags  as  going  from  a  selected  mes¬ 
sage  to  a  received  message  over  a  noisy  channel,  we  recognized  the  ef¬ 
fect  of  semantic  noise  and  accordingly  a  technique  for  handling  it  statis¬ 
tically.  Given  this  analogy  the  problem  was  to  select  the  proper  schema 
from  the  calculus  of  probability  to  allow  for  the  inverse  inference  from 
requested  index  terms  to  most  probably  relevant  document.  This  line  of 
reasoning  thus  led  us  to  the  notion  of  weighting  index  tags  and  using  Bayes' 
Theorem  to  provide  a  function  to  measure  the  degree  of  relevance  between 
an  arbitrary  request  and  any  of  the  documents  selected  by  the  request. 
Further  analysis  led  us  to  the  notion  of  automatically  elaborating  upon 
the  request  in  the  most  probable  direction  so  as  to  improve  the  selection 
of  relevant  documents. 

Given  the  fundamental  notions  of  Probabilistic  Indexing  and  a  logico- 
mathematical  explication  (presented  in  Part  II)  with  w'hich  to  back  up  our 
intuitive  understanding  of  the  problem,  let  us  now  raise  the  question  of 
justification.  That  is  to  say,  to  what  extent  does  our  probabilistic  ana¬ 
lyses  of  the  library  problem  guarantee  that  retrieval  effectiveness  will 
be  improved.  Clearly,  the  only  real  justification  for  Probabilistic  In¬ 
dexing  is  success;  i.e.,  if  the  technique  improves  retrieval  effectiveness 
then  the  system  is  justified  and  if  not,  not.  Therefore,  given  the  basic 
methods  of  Probabilistic  Indexing  a  natural  next  step  is  to  conduct  some 
actual  library  experiments  in  order  to  measure  its  degree  of  success  in 
improving  retrieval  effectiveness.  In  the  following  sections  we  shall  de¬ 
scribe  the  design,  execution  and  results  of  some  actual  experiments. 
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An  evaluation  of  these  empirical  results  provides  good  evidence  in  favor 
of  the  "theory"  of  Probabilistic  Indexing. 

Let  us  point  out  at  this  time,  that  the  value  of  actual  library  experiments 
goes  beyond  a  mere  campaign  to  evaluate  the  techniques  of  Probabilistic 
Indexing.  Actual  library  experimentation  provides  an  excellent  tool  by 
means  of  which  we  can  refine  and  extend  the  methods  and  techniques  that 
constitute  Probabilistic  Indexing.  We  have  formulated  already  what  we 
feel  are  excellent  approximate  solutions  to  some  of  the  major  problems 
and  these  notions  now  must  be  verified  and  refined,  where  necessary,  on 
the  basis  of  actual  experience.  Just  as  the  physicist  requires  such  tools 
as,  for  example,  a  linear  accelerator  in  order  to  empirically  verify  and 
suggest  new  notions  relative  to  nuclear  physics,  so  also  the  "library 
scientist"  requires  the  counterpart  of  the  linear  accelerator;  viz.  ,  a  li¬ 
brary  with  which  he  can  work  and  control.  Just  as  an  experiment  in  phy¬ 
sics  represents  a  set  of  questions  that  a  physicist  asks  of  Nature,  so  also 
the  library  scientist  needs  an  experimental  library  to  which  he  can  ask 
questions  about  the  nature  of  information  identification  indexing , '  searching, 
etc.  ;  and  thereby  to  obtain  answers  on  the  basis  of  which  to  refine  his 
original  questions  and  provide  insights  into  the  evidence  relative  to  library 
problems  and  their  solutions.  Hopefully,  our  explications  can  be  refined 
so  as  to  provide  us  with  a  good  first  approximation  to  a  fundamental  theory 
of  literature  identification,  indexing,  searching  and  retrieval. 

1 . 2  The  Experimental  Library 

A  collection  of  articles  from  Science  News  Letter^  formed  the  library 
for  our  experiments.  The  Science  News  Letter  is  a  weekly  summary  of 
current  events  in  science  and  the  articles  co^'-er  a  wide  range  of  subjects 
ranging  from  Archaeology  to  Astronomy,  Physics  to  Physiology,  and  Med¬ 
icine  to  Meteorology.  In  previous  tests  made  in  October  of  last  year  we 
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selected  200  articles  at  random  and  this  collection  constituted  the  experi¬ 
mental  library.  For  the  present  tests  we  discarded  90  of  the  original 
200  documents  in  order  to  restrict  the  library  to  articles  dealing  with  the 
Physical  Sciences.  Although  110  is  not  a  large  number  we  feel  that  our 
experimental  library  presented  us  with  most  of  the  relevant  and  basic 
problems  that  would  be  found  in  a  "real"  library  and  yet  still  be  of  a  man¬ 
ageable  size. 

Our  choice  of  articles  from  Science  News  Letter  for  inclusion  in  the  ex¬ 
perimental  library  was  dictated  to  a  large  extent  by  the  fact  that  these 
articles  are  relatively  brief,  pithy,  clearly  written,  interesting,  timely, 
uncomplex  and  easy  to  index  by  non-experts.  This  made  not  only  the  in¬ 
dexing  but  the  subsequent  evaluation  of  retrieval  documents  a  reasonably 
uncomplicated  task.  Since  this  experimental  library  wa.s  one  without  a 
previous  history  of  usage  there  were  no  statistics  on  the  a  priori  prob¬ 
ability  of  document  usage  and  consequently  these  statistics  had  to  be  simu¬ 
lated  as  discussed  in  Part  II,  1.  12.  Graph  1  shows  the  simulated  non¬ 
linear  distribution. 

1 .  3  The  Indexing  Syste m 

The  indexing  system,  again,  refers  to  the  class  of  tags  that  are  used  to 
identify  both  the  content  of  the  documents  and  the  requests  and  thus  it  is 
the  language  common  to  both  "sides"  of  the  library.  Since  the  methods 
of  Probabilistic  Indexing  are  applicable  to  any  indexing  system  we  were 
not  limited  in  our  choice  of  a  set  of  tags  to  be  used  for  the  experimental 
library.  The  only  constraint  was  that  the  number  of  tags  in  the  index  list 
be  comparable  with  the  size  of  the  library.  Instead  of  "truncating"  an 
existing  index  system  and  using  its  tags  to  index  the  documents  of  the  ex¬ 
perimental  library,  we  adopted  the  following  procedure:  Each  document 
of  the  library  was  read  and  the  key  content  bearing  words  were  selected 
and  listed.  There  were  a  total  of  577  different  keywords  in  the  list.  These 
words  were  sorted  into  categories  on  the  basis  of  their  meanings.  It 
turned  out  that  the  keywords  (as  they  are  called)  could  be  sorted  into  47 
fairly  well-defined  categories.  In  many  cases  a  particular  keyword  would 
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belong  to  more  than  one  category,  consequently  there  were  919  occur¬ 
rences  of  the  keywords  in  the  47  categories.  The  names  of  the  47  cat¬ 
egories  are  listed  in  Table  2  and  these  names  became  the  tags  that  con¬ 
stituted  the  index  term  list  for  the  experimental  library. 

These  47  index  terms  were  then  assigned  to  the  documents  by  working 
backwards  as  follows:  For  each  category  we  determined  which  keywords 
it  contained  and  each  document  which  contained  the  keyword  in  question, 
was  coordinated  in  the  corresponding  category.  That  is  to  say,  given  the 
categories,  the  keywords  in  each  category  and  the  documents  associated 
with  each  of  the  keywords,  we  then  were  able  to  determine  which  docu¬ 
ments  should  be  coordinated  with  each  category  and  thus  the  documents 
were  indexed  by  assigning  to  each  the  names  of  the  corresponding  cat¬ 
egories.  This  is  clarified  in  Diagram  4  which  shows  the  relationships 
between  documents  and  keyw'ords,  keywords  and  categories,  and,  there¬ 
fore,  documents  and  categories.  Table  3  shows  the  number  of  keywords 
that  were  associated  with  each  of  the  47  categories  and  also  the  number 
of  documents  associated  with  each  of  the  47  categories.  Graph  2  shows 
the  distribution  of  the  frequency  with  which  the  index  terms  were  used, 
plotted  against  their  rank. 

1 . 4  The  Assignment  of  Weights 

Having  assigned  the  index  terms  to  the  documents,  Probabilistic  Indexing 
requires  that  we  indicate  the  degree  with  which  each  tag  holds  for  the  doc¬ 
ument  by  assigning  weights  to  the  index  terms.  In  order  to  assign  the 
corresponding  weight  each  document  was  reread  and  then  the  indexer  de¬ 
cided  for  each  of  the  tags  coordinated  to  each  document,  the  degree  with 
which  it  held  We  had  decided  previously  that  a  reasonable  range  of  values 
for  the  weights  was  eight,  ranging  from  1/8  to  8/8.  In  order  to  aid  the 
indexer  in  obtaining  a  consistent  assignment  of  weights,  rough  weighting 
rules  were  formulated  and  these  are  shown  in  Table  4.  Table  5  shows 
the  distribution  of  weights  for  each  of  the  47  index  terms.  A  portion  of 
the  Probabilistic  Library  matrix  is  shown  in  the  Table  p,  65. 


1. 

Aerodynamics  and  Aviation 

25. 

Mathematic  s 

2. 

Agriculture 

26. 

Measurement 

3. 

Animals  (including  birds, 

27. 

Missiles  and  Rockets 

fish,  and  reptiles) 

28. 

Mystery,  Myths  and  Problems 

4. 

Archaeology 

29. 

Nature 

5. 

Astronomy 

30. 

Navigation 

6. 

Atmosphere 

31. 

Paleontology 

7. 

Atomic  Physics 

32. 

Physical  Quantities 

8. 

Biology 

33. 

Physics 

9. 

Chemistry 

34. 

Plants 

10. 

Communications 

35. 

Political  or  government  groups 

11. 

Computers 

or  functions 

12. 

Defense  and  Warfare 

36. 

Power 

13. 

Electronic  s 

37. 

Predictions 

14. 

Enginee  ring 

38. 

Psychology 

15. 

Engines 

39. 

Research 

16. 

F  ood 

40. 

Satellite  s 

17. 

Geography 

41. 

Social  Sciences 

18. 

Geology 

42. 

Space  Travel 

19. 

Geophysics 

43. 

Teaching  -  Education 

20. 

Health  and  Safety 

44. 

Time 

21. 

History 

45. 

T  ools 

22. 

Machine  ry 

46. 

T  ransportation 

23. 

Man 

47. 

Weather 

24. 

Mate  rials 

TABLE  2 

INDEX  TERMS 
DERIVED  FROM  KEYWORDS 
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1  Aerodynamics  and  Aviation 

2  Agriculture 

3  Animals  (including  birds, 

fish,  and  reptiles) 

4  Archaeology 

5  Astronomy 

6  Atmosphere 

7  Atomic  Physics 

8  Biology 

9  Chemistry 

10  Communications 

11  Computers 

12  Defense  and  Warfare 

13  Electronics 

14  Engineering 

15  Engines 

16  Food 

17  Geography 

18  Geology 

19  Geophysics 

20  Health  and  Safety 

21  History 

22  Machinery 

23  Man 

24  Materials 
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17 

30 

8 

12 

13 

27 

8 

12 

13 

29 

13 

22 

14 

35 

11 

19 

15 

23 

10 

25 

7 

12 

11 

21 

14 

16 

11 

14 

8 

15 

7 

11 

21 

36 

8 

12 

10 

11 

13 

15 

10 

21 

24 

29 

20 

29 

24 

38 

25  Mathematics 

26  Measurement 

27  Missiles  and  Rockets 

28  Mystery,  Myths  and 

Problems 

29  Nature 

30  Navigation 

31  Paleontology 

32  Physical  Quantities 

33  Physics 

34  Plants 

35  Political  or  government 

groups  or  functions 

36  Power 

37  Predictions 

38  Psychology 

39  Research 

40  Satellites 

41  Social  Sciences 

42  Space  Travel 

43  Teaching  -  Education 

44  Time 

45  Tools 

46  Transportation 

47  Weather 
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GRAPH  2 


OiSTRiBUTiON  OF  FREQUENCY  OF 
USE  OF  INDEX  TERMS  IN  THE 
PROBABILISTIC  MATRIX 
(FREQUENCY  OF  USE  VS  INDEX 
TERM  RANK) 
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INDEX  TERM  RANK 


WEIGHT 

DESCRIPTION 

WHEN  USED 

8/8 

Major  Subject 

The  term  is  highly  specific 
and.  covers  an  entire  major 
subject  of  the  document. 

7/8 

Major  Subject 

The  term  is  specific  and  cov¬ 
ers  most  of  a  major  subject 
of  the  document. 

6/8 

More  Generic  Subject 

The  term  is  too  broad  and 
covers  a  major  subject. 

5/8 

Other  Important 

Terms 

Terms  that  would  be  used  in 
a  binary  indexing  but  not  a 
major  subject. 

4/8 

Less  Generic  Subject 

The  term  relates  to  but  is  too 
narrow  to  cover  a  major 
subject . 

3/8 

Minor  Subject 

Includes  such  terms  as  relate 
to  results  of  experiments,  in¬ 
termediate  methods,  possible 
uses,  etc. 

2/8 

Other  Subjects 

Other  relevant  tags. 

1/8 

Barely  relevant 

Subjects  classifier  would  not 

want  to  use  but  feels  that  sonae 
users  might  consider  them 
relevant . 


TABLE  4 


A  GUIDE  FOR  THE  ASSIGNMENT  OF  WEIGHTS 
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WEIGHTS 


8/8 

7/8 

6/8 

5/8 

4/8 

3/8 

2/8 

1/8 

Totals 

1. 

Aerodynamics  and  Aviation 

1 

5 

3 

1 

2 

3 

2 

0 

17 

2. 

Agriculture 

1 

2 

2 

0 

2 

1 

0 

0 

8 

3. 

Animals 

1 

3 

6 

1 

3 

1 

1 

1 

13 

4. 

Archaeology 

1 

2 

4 

0 

0 

0 

0 

1 

8 

5. 

Astronomy 

3 

4 

1 

4 

i 

1 

0 

0 

12 

6. 

Atmosphere 

0 

1 

2 

4 

1 

2 

4 

0 

14 

7, 

Atomic  Physics 

3 

3 

3 

0 

2 

2 

1 

0 

14 

8. 

Biology 

0 

3 

5 

0 

3 

0 

0 

0 

1 1 

9. 

Chemistry 

3 

1 

4 

1 

2 

2 

2 

0 

15 

10. 

Communications 

3 

2 

2 

1 

O' 

1 

0 

1 

10 

11. 

Computers 

2 

1 

2 

1 

0 

1 

0 

0 

7 

12. 

Defense  and  Warfare 

3 

1 

1 

0 

1 

2 

0 

3 

1 1 

13. 

Electronics 

0 

1 

2 

1 

3 

4 

3 

0 

14 

14. 

Engineering 

0 

2 

4 

1 

1 

1 

0 

2 

11 

15. 

Engine  s 

2 

3 

0 

0 

2 

0 

1 

0 

8 

16. 

F  ood 

0 

1 

3 

1 

1 

1 

0 

0 

7 

17. 

Geography 

0 

1 

0 

4 

2 

5 

4 

5 

21 

18. 

Geology 

1 

2 

1 

0 

0 

2 

0 

0 

8 

19. 

Geophysic  s 

0 

2 

1 

0 

2 

0 

1 

4 

10 

20. 

Health  and  Safety 

0 

2 

3 

1 

4 

1 

0 

2 

13 

21.  . 

History 

3 

0 

4 

1 

1 

1 

0 

0 

10 

22. 

Machinery 

1 

0 

6 

1 

3 

6 

4 

3 

24 

23. 

Man 

0 

2 

4 

3 

0 

3 

5 

3 

20 

24. 

Materials 

3 

5 

3 

2 

3 

5 

2 

1 

24 

25. 

Mathematic  s 

0 

0 

3 

2 

0 

2 

0 

0 

7 

26. 

Measurement 

0 

1 

1 

0 

1 

2 

2 

1 

8 

27. 

Missiles  and  Rockets 

2 

1 

1 

2 

0 

2 

1 

0 

9 

28. 

Mystery,  Myths  and  Problems 

0 

0 

1 

2 

1 

2 

1 

0 

7 

29. 

Nature 

0 

1 

4 

1 

2 

0 

1 

0 

9 

30. 

Navigation 

2 

1 

0 

0 

0 

2 

1 

1 

7 

31. 

Paleontology 

3 

1 

0 

0 

1 

0 

0 

0 

5 

32. 

Physical  Quantities 

0 

1 

2 

1 

0 

4 

7 

1 

16 

33. 

Physic  s 

0 

0 

4 

0 

1 

6 

5 

3 

19 

34. 

Plants 

6 

4 

0 

0 

0 

1 

0 

0 

1 1 

35. 

Political  or  Government  etc. 

1 

2 

2 

3 

2 

4 

4 

9 

27 

36. 

Power 

2 

2 

1 

4 

3 

0 

1 

0 

13 

37. 

Predictions 

1 

0 

1 

2 

0 

0 

1 

0 

5 

38. 

Psychology 

3 

0 

2 

1 

0 

0 

0 

0 

6 

39. 

Research 

1 

0 

2 

2 

0 

0 

2 

0 

7 

40. 

Satellites 

3 

4 

3 

0 

0 

0 

0 

0 

10 

41. 

Social  Sciences 

2 

2 

3 

1 

0 

0 

0 

0 

8 

42. 

Space  Travel 

2 

0 

1 

0 

0 

2 

2 

0 

7 

43. 

Teaching  -  Education 

2 

1 

1 

1 

0 

0 

0 

0 

5 

44. 

Time 

0 

1 

0 

0 

0 

1 

4 

1 

7 

45. 

Tools 

0 

1 

0 

1 

1 

1 

2 

0 

6 

46. 

T  ransportation 

1 

2 

2 

2 

3 

2 

2 

2 

16 

47. 

Weather 

6 

1 

0 

_± 

_1 

_0 

_0 

_0 

_ 9 

Totals 

65  73  103 

TABLE  5 

51 

55 

76 

66 

46 

535 
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A  PORTION  OF  THE  PROBABILISTIC  LIBRARY  MATRIX 


Given  the  47  index  terms  we  computed  all  the  conditional  probabilities; 
i.  e.  ,  the  probabilities  that  given  a  document  tagged  with  L ,  it  will  be 
tagged  also  with  Ij^.  This  conditional  probability  matrix  is  shown  in  the 
Table  p.  67.  The  Table  p.  68  lists  for  each  of  the  index  terms  the  term 
for  which  it  has  the  highest  forward  conditional  probability;  i.  e.  ,  for 
each  Ij  it  shows  that  which  makes  a  maximum.  In  order  to 

show  more  graphically  the  connections  between  these  terms  we  have  in¬ 
cluded  the  "map"  in  Diagram  5.  The  Table  p.  70  shows  the  most  highly 
correlated  inverse  conditional  probabilities;  i.  e.  ,  for  each  D  it  shows 

that  I,  which  makes  P{I,  ,1.)  a  maximum.  The  Table  p.  7  1  is  the  matrix 
K  ^  J 

which  shows  the  coefficients  of  association  that  each  index  term  has  for 
every  other  term  and  in  the  Table  p.  7  2  the  most  highly  correlated  terms 
(in  the  sense  of  coefficient  of  association)  are  shown  for  each  of  the  47 
index  terms . 


1.  5  The  Testing  Procedure 

Our  procedure  for  empirically  testing  the  notions  of  Probabilistic  Index¬ 
ing  was  rather  straightforward  and  unsophisticated  and  can  be  described 
as  follows-  At  random  we  selected  34  documents  from  the  experimental 
library  and  from  these  "selected"  documents,  we  formulated  40  different 
questions.  These  questions  were  not  so  narrow  that  they  demanded  some 
specific  piece  of  data  for  the  proper  answer,  but,  rather,  they  were 
more  general  and  of  the  type;  "Find  information  on  the  use  of  atmospheric 
energy  for  satellite  propulsion",  rather  than,  "Find  the  specific  gravity 
of  beryllium". 

We  then  chose  five  intelligent  cooperative  technical  members  of  the  Com¬ 
pany  and  asked  them  to  act  as  test  subjects.  Each  was  given  a  list  of  eight 
questions  and  a  list  of  the  library  index  terms.  They  were  briefed  as  to 
the  nature  of  the  library  and  they  were  asked  to  formulate  a  library  re¬ 
quest  for  information  on  each  of  the  eight  questions.  We  requested  that 
they  attempt  to  have  the  "correct"  document  retrieved  and  as  little  else 


artadon.  0.65  25  Mathematics  11  Coraputtrs 


(Forward  Conditionals) 


in  CO  in 


ST 


§  ^ 
9  y 


C  >.  U  bfl  M 
O  ^  d  o» 

o  fc  W  W  «tj 


fl>  >>  J; 

t*  M)  « 

2  5  E 

H  o  S 


a 

•3 


I  ^  S 

H  'd  O 

it  P  h 


>■  u 


<;  s  o  <:  o 


vO  00 


<  w  •<  cu 


•!i*  #sj  -j*  m 


2  h 

CO  P 


5  ‘5! 


S  S  S  2  ^  ft. 


■c 

a 

3 

a 

13 


>. 

ft. 


'3  'it 

:2  9 


(Xl  tn 


H  H  H  H  ^ 


m  ^ 


K 


<M  -f. 


CO  vQ  CO  m 


m  m  CO 


o  o  o  o 


2  H 


V 

nj 

B 


II 


•  •4  n,  rt.  O 

CQ  V3  CO  ft. 


^  U  S  < 


6  So 

o  ca 

o  w 


^1— i-H.^a  0'~*  N  o 

^«)>hi.o«uo2 

•iijd  txw  U,d.i3  o  c 
».uo«>.uBe!>-o 
t30>4VvmMc;v),x} 

<<OoiQ.<WD.U 


^3  P>J  vO 


O 


ao  c  m 

c  «t:u2 


<  < 


I  2 
<:  < 


u 


L)QWWWU.000 


«  jd  ■* 

o  o  1v 

OJ  U  V 

X 


s  s 


B 


70 


as  was  reasonable.  They  were  told  the  meaning  of  the  logical  connec¬ 
tions  (conjunction  and  disjunction)  and  they  were  allowed  to  use  these 
connectives  as  desired.  Having  formulated  a  library  request  for  each 
of  the  eight  questions  they  were  asked  to  repeat  the  process,  the  second 
time  weighting  (as  desired)  the  index  tags  used  in  the  requests.  We  took 
their  library  requests  searched,  computed,  ranked  the  retrieved  documents 
and  evaluated  the  results  with  the  aid  of  the  individual  who  originally 
made  the  requests.  The  results  are  described  in  the  following  sections. 
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2.  THE  MEASURE  OF  RELEVANCE 


2.  1  Initial  Remarks 

In  this  section  we  describe  our  attempts  to  evaluate  empirically  the 
goodness  of  our  function  which  gives  a  relevance  number.  That  is  to 
say,  in  Part  II,  1,  we  explicated  the  notion  of  probable  relevance  via 
Bayes'  Theorem  and  in  the  present  section  we  shall  describe  and  inter¬ 
pret  the  results  of  a  test  to  evaluate  empirically  the  goodness  of  our 
relevance  function.  Thus,  again,  the  problem  which  we  shall  now  con¬ 
sider  is:  "How  well  does  our  function  perform  in  ranking  documents 
according  to  relevance"  ?  Only  in  section  3  shall  we  examine  the  other 
major  question;  viz.  ,  "How  good  is  the  selection  process"? 

2.  2  Some  Clarification 

Let  us  briefly  review  the  notion  of  relevance.  Exactly  what  is  it  we  wish 
to  measure?  First,  we  must  be  careful  to  distinguish  a  user's  informa¬ 
tion  need  N  from  his  request  formulation  R.  In  our  experiments  we  can 
think  of  N  as  the  particular  information  item  desired,  R  as  the  formulation 
of  this  item  in  the  request  language.  Thus,  in  the  example  question  given 
in  Table  12,  p83,  we  have 

N:  Turbojet  engines  for  commercial  air  travel, 

R:  Transportation  and  (Aerodynamic s -Aviation  or  Engines). 

In  a  real  library  system  N  will  never  be  known,  only  its  description  R 
in  a  rather  artifical  language.  The  library  indexing  system  only  relates 
documents  to  this  request  language.  But  we  want  to  relate  documents  to 
information  need.  We  see  then  that  a  bridge  between  request  language 
and  information  need  is  through  statistical  data  relating  library  users 
with  the  utility  they  derive  from  documents.  Such  statistical  data  is 
given  by  the  ^  priori  probability  distribution.  This  is  shown  by  the  theo¬ 
retical  development  which  states  that  the  probability  of  a  document  sat¬ 
isfying  the  request,  i.  e.  ,  the  probability  of  a  document  giving  the  desired 
information  item  N,  is  proportional  to  the  product  of  the  a  priori  prob¬ 
ability  (P(A,  Dj^))  and  the  value  (w.(R))  of  the  extended  weight  function  for 
the  request  describing  that  need.  In  a  sense  then  this  quantity  is  a  measure 
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of  the  degree  of  relevance  of  a  document  for  the  information  need  of  the 
requestor.  We  say  "in  a  sense"  because  of  its  unavoidable  probabilistic 
nature.  It  is  a  probabilistic  estimate  of  the  relevance  of  a  document  for 
the  information  need  of  the  requestor.  With  this  qualification  in  mind  we 
call  this  quantity  "relevance  to  information  need"  or  "probable  relevance". 
We  have; 

(relevance  to  information  need)'~\_/P(A,  D^)- a>^{R)  (1) 

2.  3  An  Experimental  Result 

If  we  examine  again  Table  1  (p.  21)  which  gives  the  various  interpreta¬ 
tions  of  the  request  language  we  see  that  it  might  be  fruitful  to  look  for 
a  quantitative  measure  of  relevance  with  respect  to  the  logical  meaning 
of  the  request;  i.  e.  ,  a  measure  of  "relevance  to  request"  as  opposed  to 
"relevance  to  information  need".  We  conjecture  that  our  computational 
procedure  for  computing  the  values  of  the  extended  weight  function  is 
such  a  measure.  We  will  present  the  supporting  data  in  subsequent  sec¬ 
tions,  but  in  anticipation  of  this  we  state  the  result;  Bayes'  Theorem 
plus  experimental  data  implies; 

(relevance  to  information  need)'^P(A,  D. )•  (relevance  to  (2) 

request  formulation). 

2.  4  The  Result  Predicted  by  Bayes’  Theorem 

The  content  of  Bayes'  Theorem  (formula  (1))  can  be  illustrated  by  the  fol¬ 
lowing  hypothetical  experiment;  Consider  a  document  in  the  experimental 
library.  It  consists  of  many  information  items.  Select  one  of  these. 

Let  a  library  user  formulate  this  item  in  the  library  request  language. 

Let  the  library  system  now  operate  on  the  request,  producing  a  collec¬ 
tion  of  documents.  (If  the  library  indexing  system  is  adequate  and  the 
formulation  of  the  request  is  accurate,  the  original  document  from  which 
the  information  item  was  derived  should  appear  in  this  retrieved  collec¬ 
tion.  )  We  now  ask  the  library  user  to  prepare  a  list; 


'Ll  the  retrieved  documents  ranked  according  to  relevance  to  the 
information  item. 

We  ask  another  person  to  prepare  a  second  list: 

the  retrieved  documents  ranked  according  to  relevance  to  the 
request  formulation. 

To  facilitate  the  processing  of  this  comparative  data  we  ask  that  the  docu¬ 
ments  be  classified  into  five  categories:  I,  Very  Relevant;  II,  Relevant; 
III,  Somewhat  Relevant;  IV,  Only  Slightly  Relevant;  V,  Irrelevant. 

Suppose  now  we  simulate  an  a  priori  probability  distribution,  and  repeat 
the  above  experiment  many  times,  each  time  selecting  a  document  by 
using  the  simulated  distribution.  For  each  request  we  obtain  lists 
and  Li^  and  a  third  list  L-: 

""  C. 

the  retrieved  documents  ranked  according  to  the  magnitude 
P{A,  D.)-w.(R). 

Bayes'  Theorem  now  tells  us  what  we  may  expect  to  find;  namely,  that 
the  list  Li^  will  agree  with  list  in  the  long  run.  More  precisely,  for 
each  request  R  there  are  many  information  items  (or  needs)  that  would 
be  formulated  by  R,  one  for  each  requestor  who  uses  R.  If,  for  each 
list  L  that  originated  with  these  requestors,  we  computed  the  mean 
relevance  evaluation  for  each  document  in  by  using  the  category  num¬ 
bers  I,  II,  III,  IV,  V,  then  the  resulting  ranking  should  agree  with  list 


2.  5  The  Experimental  Design 

The  result  predicted  above  is  difficult  to  test  empirically  because  it  would 
require  such  a  large  sample,  but,  an  experiment  designed  on  a  much 
smaller  scale  can  give  us  some  valuable  information.  Such  an  experi¬ 
ment  is  the  one  described  in  section  1.  5  and  the  following  sections.  Since 
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it  is  designed  primarily  to  test  both  the  basic  selection  process  and  the 
search  strategy  by  elaborating  the  request,  as  well  as  the  computational 
schema  for  tx)^{R),  a  flat  a  priori  probability  distribution  is  assumed,  i.  e.  , 
all  P(A,D.)  are  taken  to  be  equal.  The  significance  of  this  for  the  prob¬ 
able  relevance  concept  is  clear  by  looking  at  formula  (1); 

(relevance  to  information  need)^V^a). (R) .  (3) 

The  interpretation  of  this  by  the  phase  "in  the  long  run"  still  holds  how¬ 
ever  That  is,  we  would  not  expect  a  single  list  of  type  L,  ^  to  compare 
with  its  corresponding  list  of  type  (this  last  being  the  ranking  of  docu¬ 
ments  by  the  values  oj^(R)  in  the  case  of  equal  P(A,D.)).  This  can  be  seen 
by  noting  that  as  the  information  need  becomes  more  specific  the  evalua¬ 
tions  in  a  list  of  type  would  tend  to  split  into  the  two  classifications  of 
Very  Relevant  or  Irrelevant,  but  the  ranking  by  the  values  of  <jiJj(R)  al¬ 
ways  varies  gradually.  On  the  other  hand  a  list  of  type  L,^  r>riight  con¬ 
ceivably  be  expected  to  agree  with  the  list  in  a  single  case.  This  is 
the  content  of  the  experimental  result  stated  in  2.  3.  To  bear  this  out  we 
selected  eight  of  the  40  requests  and  obtained  for  each  of  these  requests 
a  list  of  type  L^.  In  addition  v.  had  control  lists  of  types  L,  ^  and  Li^ 
prepared  for  these  same  requests  (i.  e.  ,  each  evaluation  done  twice  by 
different  persons).  The  lists  of  type  had,  as  expected,  a  more  even 
distribution  of  documents  throughout  the  five  categories  I  (Very  Relevant) 
to  V  (Irrelevant). 

The  processing  of  this  evaluation  data  was  accomplished  on  the  following 
lines-  We  saw  what  comparative  data  was  reflected  in  the  scale  of  values 
cj^{R)  and  compared  this  with  the  evaluations  of  the  lists.  The  details 
of  the  experimental  data  and  its  analysis  are  presented  in  the  follov/ing 
sections  (2.  6-2.  8). 

2.  6  The  Hypotheses  to  be  Tested 

We  can  formulate  our  goal  as  that  of  attempting  to  confirm  that  the  value 
(R)  that  we  compute  for  each  document  selected  by  a  given  request  is. 
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in  fact,  a  measure  of  relevance  with  respect  to  the  request  formulation. 

If  our  basic  notion  is  correct  it  implies  the  following  hypothesis  which 
we  call  H  ^ . 

if  a  document  is  relevant  to  a  request,  then  a  high  number 

cj^(R)  will  be  derived  for  it. 

How  to  verify,  confirm,  test  this  hypothesis  empirically?  We  did  the 
following;  A  number  of  documents  from  our  experimental  library  were 
selected  at  random  and,  for  each  document,  a  question  was  formulated 
which  could  be  answered  by  reading  the  corresponding  document.  Sev¬ 
eral  persons  who  acted  as  test  subjects  were  briefed  as  to  the  nature  of 
the  library,  the  indexing  system,  etc.  ,  given  a  set  of  questions  and  asked 
to  formulate  a  library  request  for  information  on  the  basis  of  which, 
hopefully,  relevant  documents  would  be  retrieved  (so  as  to  answer  the 
question).  Given  the  library  requests  that  these  test  subjects  formulated 
we  proceeded  to  search  and  select  the  accession  numbers  of  those  docu¬ 
ments  satisfying  the  logic  of  the  request.  For  each  request  a  list  of 
documents  (i  e.  ,  a  list  of  the  corresponding  accession  numbers)  was 
generated  and  the  documents  in  the  lists  were  ranked  according  to  the 
number  w.(R)  that  was  computed  for  each.  We  then  examined  each  list 
to  determine  whether  or  not  the  so-called  "answer"  document  was  on 
the  list  and  if  it  was,  we  recorded  its  relative  position  on  the  list.  We 
made  the  (natural)  assumption  that  the  answer  document  (i.  e.  ,  the  docu¬ 
ment  on  the  basis  of  which  the  question  was  formulated)  would  be  rele¬ 
vant  to  the  request.  We  then  determined  the  number  of  times  that  the 
correct  answer  document  was  retrieved  associated  with  a  high  number 
to^(R).  The  results  can  be  summarized  as  follows.  Forty  library  requests 
were  made  and  in  27  cases  the  answer  document  was  retrieved.  The 
number  of  documents  on  the  output  lists  ranged  from  a  minimum  of  1 
(in  four  cases)  to  a  maximum  of  4  1 ,  In  the  majority  of  the  23  cases  which 
contained  more  than  a  single  document,  the  answer  document  appeared 
towards  the  top  of  the  list. 
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The  results  showed  that  if  the  answer  document  was  on  a  list,  then  it 


was  computed  to  have  a  high  number  u)j^(R)  in  most  of  the  cases  .  This  evidence 
thus  supports  the  hypothesis  which  asserts  that  if  a  document  is 

relevant  a  high  value  oo.{R)  will  be  computed  for  it.  However,  it  was  not 
always  the  case  that  the  answer  document  was  computed  to  have  the  high¬ 
est  number;  i.  e.  ,  there  were  documents  other  than  the  answer  document 
for  which  a  high  number  was  derived.  Thus,  the  question  arises:  "If 
a  document  has  a  high  number  a).(R),  is  it  relevant  to  the  request"?  This 
represents  the  converse  of  the  original  hypothesis  H^.  We  shall  form 
this  as  an  hypothesis  and  call  it  H^. 


if  a  document  has  a  high  number  u)j^(R),  then  it  is  relevant  to 
the  corresponding  request. 


If  we  can  confirm  as  well  as  H.  we  will  have,  in  fact,  confirmed  an 
hypothesis  H  which  is  stronger  than  each. 

H  the  methods  of  Probabilistic  Indexing  will  derive  a  high  number 
a)^(R)  for  an  arbitrary  document  Jf  and  only  if  the  document  in 
question  is  relevant  to  the  request. 


In  order  to  determine  if  there  were  relevant  documents,  other  than  the 
answer  document  on  a  list  we  had  to  have  evaluation  data  of  the  type  de¬ 
scribed  in  2,  5.  We  obtained  a  sample  of  this  information  from  the  test 
subjects  in  the  following  way.  Four  of  the  five  test  subjects  were  given 
the  actual  documents  corresponding  to  the  retrieval  lists  and  they  were 
asked  to  read  each  document  and  decide  whether  they  considered  it  to  be 
Very  Relevant,  Relevant,  Somewhat  Relevant.  Only  Slightly  Relevant, 
Irrelevant.  Thus  for  each  document  retrieved  they  would  judge  to  which 
of  these  five  categories  it  belonged  and  we,  in  turn,  compared  their  judg¬ 
ments  with  the  numbers  w^{R)  which  we  had  computed  for  each  document. 
A  fifth  person  prepared  control  lists,  i.  e.  ,  evaluations  for  the  same 
requests . 
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In  order  to  facilitate  the  comparison  we  standardized  the  values  a)j^(R); 

1.  e.  ,  we  multiplied  each  value  by  the  reciprocal  of  the  highest  value  to 
force  the  numbers  on  each  list  to  vary  from  1  to  0  -  0  being  the  value 
assigned  to  unretrieved  documents.  We  also  divided  the  numbers  into 
three  categories;  high  (value  equal  to  or  greater  than  0.  75),  medium 
(value  between  0.  75  and  0.  25)  and  low  (value  equal  to  or  less  than  0.  25). 
The  results  show  quite  definitely  that  if  a  document  has  a  high  number 
a)j^(R)  that  document  was  judged  by  the  evaluator  as  Very  Relevant  or 
Relevant,  in  most  cases.  Conversely,  if  the  number  cl).(R)  was  low  the 
evaluators  rated  the  corresponding  document  as  either  Only  Slightly 
Relevant  or  Irrelevant  in  most  cases. 

Thus  the  data  supports  the  following:  "If  a  document  is  relevant  to  a  re - 
quest,  then  there  is  a  strong  probability  that  the  document  will  have  a 
a  high  number  «^(R)  computed  for  it.  "  Furthermore,  the  data  supports 
the  converse;  viz.  ,  "If  a  document  is  computed  to  have  a  high  number 
w^(R),  there  is  a  strong  probability  that  it  is  relevant  to  the  request". 

Thus  the  data  supports  both  H  and  and  taken  jointly  we  see  that  the 
data  does  support  and  confirm  the  stronger  hypothesis  H  ' ,  viz.  ,  a  high 
number  w^(R)  will  be  derived  if  and  only  if  the  document  in  question  is 
relevant  to  the  request.  The  details  of  the  analysis  are  presented  in  the 
following  section.  -  - 

2.  7  Analysis  of  the  Data 

The  eight  lists  involved  in  the  evaluation  with  respect  to  request  relevance 
had  a  sum  total  of  69  documents.  First  we  examine  how  the  values 
u)j^(R)  associated  with  these  documents  were  distributed  among  the  five 
categories.  Computing  the  average  value  and  the  variance  in  each  of 
the  five  categories,  we  obtained  the  following  results. 
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DOCUMENT  RATING 

MEAN 

VARIANCE 

I. 

Very  Relevant 

0.  81 

0.  043 

II. 

Relevant 

0.72 

0.  053 

III. 

Somewhat  Relevant 

0.  54 

0.  043 

IV. 

Only  Slightly  Relevant 

0.  40 

0.  no 

V. 

Irrelevant 

0.  18 

0.  013 

Thus  we  see  that  the  values  of  the  numbers  that  we  computed  decrease, 
on  the  average,  as  we  go  from  Category  I  (Very  Relevant)  to  Category  V 
(Irrelevant). 

Although  this  result  tends  to  confirm  our  hypotheses  H-,,  H  we  pre¬ 

fer  to  look  deeper  into  the  situation.  Let  us  denote  the  class  of  all  docu¬ 
ments  with  numbers  oj.(R)  greater  or  equal  to  0.75  by  "High";  those  with 
numbers  less  than  or  equal  to  0.  25  by  "Low".  Let  us  also  call  categories 
I  or  II  simply  "Relevant"  and  category  V,  as  before,  "Irrelevant".  Note 
that  Relevant  and  Irrelevant  are  not  negations  of  each  other  since  we 
have  the  intermediate  categories  III  and  IV  consisting  of  documents  neither 
totally  Relevant  nor  Irrelevant.  Now  the  hypotheses  Hj,  H  say 

two  things; 

(1)  Relevant  is  equivalent  to  High, 

(2)  Irrelevant  is  equivalent  to  Low, 

and  imply  two  weaker  statements- 

(3)  Relevant  implies  not-Low, 

(4)  Irrelevant  implies  not -High. 

The  statistical  confirmation  of  these  statements  can  be  accomplished  by 
using  the  theory  of  the  coefficient  of  association  between  predicates  as 
outlined  in  section  2.  6.  That  is  to  say,  each  of  the  statements  above 
calls  for  a  study  of  a  matrix  of  the  kind  defined  on  p.  42;  i.  e.  ,  a  sorting 
of  the  69  documents  according  to  the  properties; 
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(1)  Relevant  and  High, 

(2)  Irrelevant  and  Low, 

(3)  Relevant  and  Low, 

(4)  Irrelevant  and  High. 

We  would  expect  to  find  the  Q-values  in  (1)  and  (2)  to  be  near  +1  (maxi¬ 
mum  positive  association),  the  Q-values  in  (3)  and  (4)  near  -1  (maximum 
negative  association).  These  values  are  in  fact; 

Q  (Relevant,  High)  =+0.  70, 

Q  (Irrelevant,  Low)  =+0.  90, 

Q  (Relevant,  Low)  =  -0,  92, 

Q  (Irrelevant,  High)=-1.00. 

Since  these  values  are  fairly  sensitive  we  introduce  a  control  on  the 
study  by  assuming  that  these  predicates  are  statistically  independent, 
then  computing  the  probabilities  of  the  Q-values  having  been  as  close 
or  closer  to  the  anticipated  values  by  chance.  For  the  four  distributions 
we  calculate  these  control  probabilities  to  be  0.  04 1 ,  0.006,  0.010,  0.059, 
respectively.  ^ 

2.  8  A  Note  on  other  Data 

We  have  still  to  consider  the  weighted  request.  Recall  that  we  have  two 
types  of  inputs  to  consider;  viz.  ,  the  conventional  request  (an  affirma¬ 
tive  Boolean  function  of  the  index  terms)  and  the  weighted  request.  These 
lead  to  two  different  output  lists.  On  page  83  we  include  a  typical  data 
tabulation  sheet  for  one  of  the  questions  that  we  used  in  our  experiments. 
The  conventional  request  in  this  case  was  expressed  formally  by  the 
expression 


That  is  to  say,  if  Relevant  and  High  are  independent  then  the  probability 
of  their  Q-value  having  the  property  0.  70^  Q  ^  1. 00  is  0.  041.  Simi¬ 
larly  for  the  other  classifications. 
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Question:  Turbojet  Engines  for  Commercial  Air  Travel. 

Answer  Document:  92 
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A  DATA  TABULATION  SHEET 


and  the  computation  by  Bayes'  schema  resulted  in  the  listing  shown  in 
column  P.  The  corresponding  weighted  request  was  expressed  formally 
as 

(0.8)1^^.  Qo.3)I^v{0.9)Ii5^ 

and  the  results  are  shown  in  column  W.  For  completeness  we  did  cal¬ 
culations  using  a  simulated  non-linear  a  priori  probability  distribution. 

The  resulting  relevance  numbers  are  shown  in  the  two  columns  labelled 
P'  and  W.  The  answer  document  for  this  particular  question  was  docu¬ 
ment  92  and,  quite  by  chance,  it  appears  in  the  second  position  on  each 
of  these  four  lists. 

The  results  for  the  case  of  the  weighted  request  (flat  a  priori  probability 
distribution  assumed)  do  confirm  the  basic  thesis  which  asserts  that  the 
number  is,  in  fact,  a  measure  of  relevance  with  respect  to  request; 

however,  the  data  are  not  as  confirmatory  as  for  the  case  when  the  re¬ 
quest  is  an  affirmative  Boolean  function.  The  reason  for  this  is  that 
the  evaluations  of  document  relevance  were  oriented  toward  the  unweighted 
request.  ^  However  a  consideration  of  the  variation  of  the  mean  value  of 
Wj^(R)  is  still  of  interest.  Analogous  to  the  table  on  p.  81  we  have: 


DOCUMENT  RATING 

MEAN 

VARIANCE 

I. 

Very  Relevant 

0.  87 

0.  031 

II. 

Relevant 

0.  53 

0.  095 

III. 

Somewhat  Relevant 

0.  39 

0.  076 

IV. 

Only  Slightly  Relevant 

0.  45 

0.  123 

V. 

Irrelevant 

0.  33 

0.  073 

For  example:  An  evaluation  of  document  relevance  with  respect  to  the 
request  "Psychology  or  Teaching"  would  give  quite  different  results  than 
when  evaluated  with  respect  to  the  request  "(.7)  Psychology  or  (.  3)  Teaching". 
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3.  ELABORATION  OF  THE  SELECTION  PROCESS 


3.  1  Initial  Remarks 

The  basic  aim  behind  Probabilistic  Indexing  has  been  the  obvious  one;  viz.  , 
to  improve  retrieval  effectiveness.  The  fundamental  notion  has  been  to 
introduce  arithmetic  (as  opposed  to  logic  alone)  into  the  problem  of  in¬ 
dexing  and  thereby  pave  the  way  for  the  use  of  mathematical  operations 
so  as  to  compute  probable  relevance.  Thus  the  fundamental  notion  which 
acts  as  a  wedge  to  drive  an  opening  into  the  basic  problem  of  retrieval 
effectiveness  is  that  of  the  relevance  number  (as  explicated  in  terms  of 
the  theorem  of  Bayes).  The  relevance  number,  as  we  have  seen,  provides 
a  means  of  ranking  documents  according  to  their  probable  relevance. 
However,  the  solution  to  the  problem  of  retrieval  effectiveness  involves 
more  than  ranking  by  relevance--it  involves  the  proper  selection  of  those 
documents  which  are  to  be  ranked.  Before  we  describe  the  results  of  the 
experiments  that  were  conducted  to  test  our  methods  for  improving  the 
selection  process,  let  us  take  one  more  look  at  the  relevance  number  as 
a  filter  to  eliminate  low  relevance  documents.  In  particular,  let  us  con¬ 
sider  the  usefulness  of  the  relevance  number  on  unelaborated  requests. 

In  our  experiments,  40  different  library  requests  were  made  and  a  total 
of  379  documents  were  retrieved  (usirg  the  basic  process  of  selecting 
those  documents  whose  tags  are  logically  compatible  with  the  logic  and 
tags  of  the  request).  Let  us  compare  the  results  of  probabilistic:  search¬ 
ing  and  so-called  "binary"  or  conventional  searching.  We  can  do  thi.3  by 
assuming  that  all  the  tags  which  are  assigned  to  documents  with  a  non¬ 
zero  weight  are,  in  fact,  assigned  to  the  corresponding  documents  in  the 
conventional  system.  Thus  when  the  basic  selection  process  is  the  same 
(viz.  ,  the  unelaborated  logical  matching  process),  the  same  documents 
will  be  retrieved  in  both  cases;  however,  in  the  conventional  system  the 
retrieved  documents,  are  not  ranked  by  any  criteria  of  relevance.  For 
each  of  the  retrieval  lists  if  n  documents  have  been  retrieved  and  the  an¬ 
swer  document  is  present,  then  using  the  conventional  search  technique 
the  requestor  must  read,  on  the  average  -  ^  ^  documents.  If  the  answer 
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document,  is  not  present,  then  all  of  the  retrieved  documents  must  be  read 
(in  order  to  determine  that  no  relevant  information  was  retrieved).  These 
considerations  (inadequate  though  they  be,  since  they  presuppose  that 
only  an  answer  document  produces  a  satisfactory  search  result)  give  us 
a  criterion  with  which  to  compare  the  probabilistic  and  binary  searches. 

This  criterion  is  the  total  number  of  documents  that  would  have  to  be  read 
for  all  40  searches  in  order  to  find  the  answer  documents.  The  results 
are  as  follows; 

Total  Number  of  Total  Number  of  Documents 

Type  of  Search  Documents  Retrieved  that  would  have  to  be  Read 

Binary  .379  235 

Probabilistic^  379  181 

Thus  we  see  that  a  conventional  system  would  require  the  user  to  read 
approximately  30  percent  more  retrieved  documents  to  obtain  the  same 
number  of  answer  documents.  These  two  different  searches  ,  each  using 
the  basic  selection  process,  produced  27  answer  documents  out  of  a  pos¬ 
sible  40,  (Note  that  the  binary  search  as  defined  above  is  more  extensive 
than  might  be  expected  in  the  sense  that  we  have  used  all  the  tags  with 
non-zero  weights  as  binary  tags.  In  an  actual  conventional  system  those 
tags  with  a  low  weight  would  probably  not  be  coordinated  with  documents. 

That  is  to  say,  the  use  of  weighted  tags  encourages  more  tags  to  be  ap¬ 
plied  to  a  given  document  than  would  be  the  case  if  weights  were  not  al¬ 
lowed.  1ji  a  previous  study  where  documents  were  indexed  independently 
by  two  different  indexers,  one  using  probabilistic  indexing,  the  other  us¬ 
ing  binary  indexing  (i.  e.  ,  either  a  tag  holds  for  a  document  or  it  does  not), 
it  was  found  that  70  percent  more  answer  documents  were  retrieved  in 
the  probabilistic  search  and  only  32  percent  more  documents  had  to  be 
read.  ) 


A  flat  a  priori  probability  distribution  was  used. 
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The  above  comparison  presupposes  that  the  user  is  looking  for  some 
specific  information  (viz.  ,  the  answer  document)  and  that  he  knows  when 
he  has  found  it.  It  might  be  more  realistic  to  make  no  such  assumption; 
therefore,  let  us  consider  the  following  comparison.  Given  a  request  for 
information  a  probabilistic  search  is  made  but,  beforehand,  we  tell  the 
user  to  read  only  those  documents  which  have  a  computed  relevance  num¬ 
ber  greater  than  0.  5.  That  is  to  say,  "before  the  facts"  we  give  the  re¬ 
questors  a  guide  to  use  in  reading  the  40  lists  presented  to  them.  It  turns 
out  that  of  the  379  documents  in  the  40  lists  there  are  only  225  which  have 
a  relevance  number  greater  than  0.  5.  Furthermore,  it  turns  out  that  if 
the  users  had  adopted  the  strategy  of  reading  only  those  retrieved  docu¬ 
ments  which  have  relevance  numbers  greater  than  0.  5,  then  they  would 
have  found  25  of  the  27  answer  documents.  ^  Now  compare  this  with  the 
case  of  conventional  retrieval  where  the  users  would  have  to  read  all  of 
the  379  retrieved  documents  (since  there  is  no  way  to  distinguish  between 
any  two  documents  in  the  same  list).  In  this  latter  case  the  users,  of 
course,  would  find  all  27  answer  documents,  but  again  at  the  "cost"  of 
reading  all  379  documents.  Thus  we  see  that  a  conventional  system  would 
require  that  users  read  68.  5  percent  more  documents  than  for  the  prob¬ 
abilistic  system  and  they  would  gain  only  7.  4  percent  in  increased  num¬ 
ber  of  answer  documents. 

These  considerations  indicate  that  the  relevance  number  can  be  used  to 
filter  out  irrelevant  material.  That  is  to  say,  if  we  use  the  relevance 
number  associated  with  documents  to  separate  the  relevant  from  the  ir¬ 
relevant,  we  are  providing  the  user  with  a  valuable  tool. 

3.  2  The  Automatic  Elaboration 

We  have  described  two  methods  for  automatically  elaborating  upon  the 
selection  process  which  is  involved  in  information  searching.  One  method 

^In  one  of  the  two  remaining  cases  the  relevance  number  of  the  answer 
document  was  just  under  0.  5  and  in  the  other  case  the  answer  document 
had  a  rather  low  number  but  it  was  third  in  a  list  of  only  three. 
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establishes  a  measure  of  closeness  in  document  space  and  the  other 
method  involves  measures  of  closeness  in  request  space.  We  shall  not 
consider  the  former  since,  as  yet  no  experimental  tests  have  been  com¬ 
pleted  which  would  enable  us  to  evaluate  the  notions  of  distance  as  de¬ 
scribed  in  Part  II,  2.  7.  We  measure  closeness  in  request  space  by  de¬ 
termining  certain  statistical  relationships  between  the  index  terms  of  a 
request  and  other  terms.  Specifically,  we  have  described  three  different 
statistical  relationships,  viz.  ,  forward  conditional  probabilities,  inverse 
conditional  probabilities,  and  coefficients  of  association.  We  now  raise 
the  questions;  "How  good  are  the  proposed  statistical  measures  of 
closeness  in  elaborating  upon  a  request?"  and  "Which  of  the  three 
measures  that  have  been  discussed  is  the  best?"  Again,  in  the  case  of 
the  automatically  elaborated  request  we  generate  the  new  request  R' 
given  the  initial  request  R  by  formulating  the  following  type  of  disjunc¬ 
tion  for  each  tag  in  R; 

if  R  =  I.,  then  R'  =1.  v(a)I.' 

J  J  J 

where  a  is  the  measure  of  closeness  between  I.  and  1.'  and  I.'  is  the  term 

J  J  J 

that  gives  maximum  a  with  respect  to  R  We  would  like  to  be  able  to  es¬ 
tablish  the  following. 

1.  That  the  elaborated  request  catches  relevant  documents  which 
are  not  selected  by  the  original  Junelaborated)  request. 

2  That,  although  the  elaborated  request  catches  more  documents, 
the  relevance  number  can  be  used  as  a  guide  for  eliminating 
the  ones  with  low  probable  relevance, 

3.  3  Some  Testing  (Evaluation)  Problems 

Since  we  are  really  interested  in  the  over -all  retrieval  effectiveness  of 
the  selection  process,  we  v/ould  like  to  know  how  many  of  the  relevant 
documents  in  the  entire  library  have  been  caught  by  the  elaborated  re¬ 
quests,  In  order  to  determine  this  it  would  be  necessary  for  us  to  pre¬ 
sent  to  the  requestor  the  entire  library  so  that  he,  in  turn,  could  judge 
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■which  relevant  documents,  if  any,  were  not  retrieved.  That  is  to  say, 
in  order  that  a  user  properly  judge  whether  or  not  he  did,  in  fact,  re¬ 
ceive  all  relevant  documents  as  the  result  of  a  search,  he  would  have 
to  be  familiar  with  the  entire  contents  of  the  library.  Because  of  this 
difficulty,  we  see  that  such  an  evaluation  would  be  impractical  to  con¬ 
duct.  We  must,  therefore,  lower  our  sights  and  look  for  a  substitute 
type  of  evaluation.  The  substitute  that  we  have  adopted  consists  in,  a- 
gain,  using  the  answer  documents  as  a  measure  of  retrieval  effective¬ 
ness.  That  is  to  say,  since  we  know  that  the  answer  documents  are 
relevant,  we  can  automatically  elaborate  upon  those  original  requests 
which  did  not  catch  the  answer  document  in  order  to  see  whether  the 
elaborated  request  succeeds  in  retrieving  it.  Such  a  test  would  allow 
us  to  establish  some  measure  of  the  retrieval  effectiveness  of  the  auto¬ 
matic  elaboration  procedures.  We  can  compare  the  total  number  of 
documents  for  the  elaborated  requests  with  what  would  be  the  case  for 
the  unelaborated  request.  This  we  have  done  and  the  results  are  dis¬ 
cussed  in  the  following  section. 

3.  4  Some  Results 

Of  the  40  requests  that  were  made  the  answer  document  was  retrieved 
in  7.1  cases  and  it  was  not  retrieved  in  13  cases.  We  conducted  three 
different  types  of  elaborated  requests  for  each  of  the  40  cases.  The 
results  are  as  follows; 

1.  Using  the  method  of  request  elaboration  via  forward  conditional 
probabilities  between  index  tags,  we  retrieved  the  correct  an¬ 
swer  document  in  32  cases  out  of  the  40. 

2.  Elaborating  the  requests  via  the  inverse  conditional  probability 
heuristic  we  retrieved  the  correct  document  in  33  of  the  40 
cases. 

3.  Using  the  coefficient  of  association  to  obtain  the  elaborated  re¬ 
quest  we  obtained  success  in  33  cases  of  the  40. 
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Thus  we  see  that  the  automatic  elaboration  of  a  request  does,  in  fact, 
catch  relevant  documents  that  were  not  retrieved  by  the  original  request. 

We  now  raise  the  question:  "Because  of  the  small  size  of  the  library 
and  the  large  percent  of  the  total  library  that  is  selected  by  the  elaborated 
request,  are  the  above  results  statistically  significant?"  That  is  to  say, 
what  is  the  probability  of  doing  as  well  or  better  just  by  selecting  at 
random,  for  each  of  the  13  requests  fcr  which  the  answer  document  was 
not  originally  retrieved,  a  sample  of  size  equal  to  that  given  by  the 
elaborated  requests.  We  have  made  the  corresponding  calculations  and 
it  turns  out  that  probability  of  doing  as  well  or  better  by  chance  is  less 
than  0.  034  for  both  the  forward  and  inverse  conditional  probability  elab¬ 
orations  and  less  than  0.001  for  the  coefficient  of  association  search. 

Thus  the  above  results  are  indeed  statistically  significant. 

Could  the  number  of  answer  documents  have  been  improved;  i.  e.  ,  could 
40  out  of  40  answer  documents  have  been  retrieved.  We  looked  at  the 
seven  cases  for  which  the  answer  document  was  not  retrieved  when 
elaborating  via  the  coefficient  of  association  and  in  three  cases  the  in¬ 
dexing  was  at  fault.  That  is  to  say  in  three  of  the  seven  cases  the  an¬ 
swer  document  was  poorly  indexed  (a  fact  of  life  that  must  be  faced  by 
all  libraries).  In  one  case  the  request  formulation  was  very  poor  and 
no  reasonable  elaboration  would  help.  In  one  case  the  answer  document 
was  caught  by  a  different  heuristic  (viz.  ,  the  forward  conditional)  and 
in  the  remaining  two  cases,  again,  the  requests  suffered  by  being  poorly 
formulated 

Now  consider  the  fact  that,  although  the  automatic,  elaboration  of  a  re¬ 
quest  does  catch  relevant  documents  that  would  not  otherwise  have  been 
selected,  it  also  increases  the  total  number  of  retrieved  documents, 

(We  point  out  at  this  time  that  of  the  three  heuristics  which  we  considered, 
the  one  which  elaborated  via  the  coefficient  of  association  gave  the  greatest 
ratio  of  answer  documents  to  total  documents  retrieved,  )  In  order  to 
have  the  advantages  of  an  elaborated  request  (namely,  the  relevant 
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documents  that  it  obtains)  and  in  order  to  avoid  the  disadvantages 
(namely,  the  larger  number  of  total  documents)  we  now  introduce  the 
relevance  number  to  truncate  the  output  lists.  That  is  to  say,  we  use 
the  relevance  numbers  to  separate  out  the  highly  relevant  from  the  less 
relevant  documents,  by  adopting  the  following  rule;  Only  those  docu¬ 
ments  which  are  selected  by  the  elaborated  request  and  which  have  a 
standardized  relevance  number  greater  than  0.  5  are  to  be  retrieved. 

Our  experiments  with  the  coefficient  of  association  heuristic  show  that 
of  a  total  of  66  1  documents  that  were  selected  by  the  elaborated  requests 
only  446  (or  67.  5  percent)  have  a  standardized  relevance  number  greater 
than  0.5.^  Furthermore,  if  we  adopt  this  rule,  then  32  out  of  the  33 
(or  97  percent  of  the)  answer  documents  which  are  selected  by  the  auto¬ 
matic  elaboration  would  still  be  retrieved;  i.  e.  ,  32  of  the  33  answer 
documents  had  relevance  numbers  greater  than  0.  5. 

We  conclude  by  observing  that,  to  a  very  large  degree,  the  procedures 
for  automatically  elaborating  upon  a  request  are  empirical;  i.  e.  ,  their 
development  and  refinement  must  rest  on  further  empirical  testing  and 
experimentation.  Hopefully  the  results  of  further  tests  will  shed  light 
on  and  provide  new  insights  into  the  difficult  and  exciting  problems  of 
information  identification,  search  and  retrieval. 


For  these  computations  we  used  a  flat  a  priori  probability  distribution. 
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